Kestrel

On ELF, Part 2

01 Feb 2018

Abstract

The Executable and Linkable Format, ELF, is too complex. I firmly feel it was created without critically thinking about the ramifications it would have on future tool chains. In this article, I show how ELF-like features can be safely retrofitted onto executable formats contemporary with ELF’s debut. Hopefully, future executable format authors will reconsider their needs more critically in the future before committing to something as complicated as ELF.

Basic Principles

Before writing this installment, I had a realization. One of the big mistakes from Unix, besides the X Window System, was the ld linker. At the time, it must have seemed like a great idea! Put yourself in the original authors’ shoes: why should the kernel spend so much effort relocating a binary when that image will just appear at the same place in every process? So, why not just relocate any linked executable at that point before the kernel ever gets a chance to see it? The kernel can be made much simpler (just read in an opaque binary blob, then call a specified address), and as a nice side-effect, it’s faster to load the programs as well, which was probably noticeable on timesharing systems of that era.

But, as we’ve seen in the previous article, this can open a can of worms from the security point of view as well as constrain new features, such as the new hotness at about this time, component-oriented programming (what eventually led to CORBA and DCOM). It turns out that program relocation at load-time is a good thing.¹ The design of ELF, in part, reflects the realization that mistakes were made in the past, and AT&T wanted to both move forward with new features while preserving at least some of their initial investment. (Aside: I, personally, have mixed feelings about this; but, this article is not about those specific feelings.)

So, why is ld such a big mistake? Bluntly, it’s a tool which, like ELF, fails to adhere to the Unix philosophy of doing one thing well. See, ld is responsible for at least two services:

Linking multiple compiler artifacts into a single artifact for later re-use in the construction of software, and,
Producing a loadable binary from one or more of these re-usable artifacts.

While these two concepts are clearly related, they are most definitely not the same thing. We see that the Plan 9 compiler toolchain partially remedies this past oversight by keeping software in a binary representation of an assembly language listing for as long as possible before the very last step of producing the final binary executable.

Basically, there are two principles involved.

Principle of Linking. A linker’s job is to coalesce, merge, and perhaps even sort sections of equivalent type. If it helps you, think of it as a merge-sort for related code. A typical program might have hundreds of sections, which a linker might reduce to a smaller, perhaps still quite numerous, quantity. Ultimately, by the time you yield a final executable, it’s been massaged down to a much smaller set (typically, text, data, and BSS).

Principle of Loading. A loader’s job is to unmarshall the concrete program representation in memory from an off-line representation. This prepares a program for execution. In the case of dynamic linking, a loader may perform some basic relocations, but it never merges sections of comparable types. It just lays them out in the address space somehow. This can be seen on any Linux installation by looking at a process’ memory map layout (sudo cat /proc/$PID/maps, which will show each dynamically loaded module’s text, data, BSS, and stack segments). I’m sure BSD and Plan 9 have similarly accessible methods of showing this information.

Where ld goes wrong, then, is that it attempts to both link and load in one step. The limitations of this approach do not become visible as long as

Your runtime environment is isolated via a page-capable MMU, and,
You have no need whatsoever for dynamic linking.

Violate any of these assumptions, and ld’s approach breaks down hard. This is why ld requires so many horrible-looking arguments when linking code together: one set of options intends to put a choke-hold on ld from building an executable or shared object when all you want is a relocatable module, while another set is often to do just the opposite.

I mention this only because I want this article to focus exclusively on program loading. That is, nothing I write in this article is intended to relate to linking at all. The formats below may or may not make suitable targets for linkers. In fact, there’s a great chance that they won’t at all. Plan 9 accepts this. There’s no reason why the progeny of Unix can’t either.

So, with the full understanding that we’re talking exclusively about loading programs, let us now tear ELF a new hole in its address space by illustrating viable alternatives to it.

ELF versus GEMDOS PRG Format

In the previous article, I had stated that the a.out format was one of the earliest and most portable of executable formats around. To illustrate this point, let’s compare Unix’s original a.out format to an OS where you might not have expected it to show up: Atari TOS and GEMDOS, the two components that make up the Atari ST/TT’s operating system.

PDP-11 Unix

struct aout_unix {
    short   a_magic;    // 0x0085
    short   a_textsz;   // Size of .text segment
    short   a_symsz;    // Size of symbol tables
    short   a_relocsz;  // Size of fixups
    short   a_datasz;   // Size of .data segment
    short   a_padding;
};

Atari TOS

typedef struct
{
   WORD  ph_branch;        /* Branch to start of the program  */
                           /* (must be 0x601a!)               */

   LONG  ph_tlen;          /* Length of the TEXT segment      */
   LONG  ph_dlen;          /* Length of the DATA segment      */
   LONG  ph_blen;          /* Length of the BSS segment       */
   LONG  ph_slen;          /* Length of the symbol table      */
   LONG  ph_res1;          /* Reserved, should be 0;          */
                           /* Required by PureC               */
   LONG  ph_prgflags;      /* Program flags                   */
   WORD  ph_absflag;       /* 0 = Relocation info present     */
} PH;

Some observations:

The respective fields appear in different places depending on which header is used. But, that’s OK; ELF 32-bit and 64-bit structures also have fields which move about to support different processor alignment restrictions.
TOS supports an explicit .bss segment, while the original Unix loader did not. BSS was just a pre-zeroed bunch of bytes tacked onto the end of .data, and accounted for in a_datasz.
Interestingly, both the PDP-11 and the 68000 formats supported fixups to resolve code and data references. This suggests there was a time when Unix did not run with a page-based MMU.

After the file header, you’ll find the actual executable artifacts:

+-------------------------------+
| PH header                     |
+-------------------------------+
| TEXT segment                  |
+-------------------------------+
| DATA segment                  |
+-------------------------------+
| Symbol table                  |
+-------------------------------+
| Relocation table              |
+-------------------------------+

That’s it: there are no tables containing pointers to segments, there are no complex graphs to traverse to get at some piece of information, there exists no reason to invest more than a handful of printed pages to describe the entire structure. Despite this minimalism, the a.out format is surprisingly versatile. It supported TOS applications in a single address space, it supported MultiTOS applications in a single address space, and it later supported MultiTOS applications in separate address spaces as well. Clearly, we see that a.out is quite adept at handling both DOS-like and Unix-like environments with equal facility.

Note that some details differ between the two a.out headers. That’s perfectly OK. Exact layout doesn’t matter, what matters are the concepts supported. It gives the loader just enough information to allocate a chunk of memory for code, a chunk of memory for data and BSS, to locate the relevant data in the files and load them into the allocated memory, and then apply fixups to resolve broken references. The a.out binaries were self-contained and self-consistent otherwise: it was not possible to legally create an a.out file which referred to an undefined symbol. Intra-segment relocations (for instance, where a symbol in module A’s .text segment refers to a symbol in the .text segment in module B) are resolved by the linker prior to emitting the final a.out, often relying upon PC-relative or register-relative addressing. Absolute references (e.g., with JSR or JMP instructions, or when storage in .data was referenced from a location in .text) were resolved using the bundled fixups. The reason for this is that there was little to no guarantee that .data and .text would remain adjacent in the computer’s address space, particularly for single-address space environments.

In my previous article, I’d stated that a.out was intended to be loaded/mapped blindly as an opaque blob. This is especially evident in the original Unix header, as the “magic” field, used to identify executables from other file types, is literally a PDP-11 machine language instruction that jumps over the header. The Atari TOS program header continues this tradition.

Extending to Support Dynamic Linking

OK, so now that we’ve seen how trivial a.out is, how do we extend it to support dynamic linking? Pretend, for a moment, that you were in charge of adding dynamic linking to MultiTOS for its next release. How would you retrofit this loader format to do so? Here is how I would handle the task.

Required Extensions

The first task would be to identify what additional support is required to handle the job.

First and foremost, a statically linked binary is expected to be self-contained. Dynamic linking throws that assumption out the door; therefore, we need a way for a module to identify its dependencies. So, we need to introduce a segment that lists library dependencies. Let’s not concern ourselves with precise layout at this point; we only need to know that it consumes space in the file, and thus, has a size field associated with it.

Second, we observe that dynamically linked modules may require symbols defined in the executable. It turns out a.out has us covered here, as both PDP-11 and TOS versions of the file support a symbol table explicitly. Convenient! So out-bound definitions are taken care of; however, in-bound relocations are not. There are two ways to handle this: first, we can extend the existing symbol table definition to support both symbol definitions and symbol references, or we can introduce a separate symbol table segment just for in-bound references. Just to make things harder and for the sake of illustration, we’ll assume the latter. In the real-world, I’d probably shoot for the former.

To support in-bound symbol relocations, we will also need to extend our relocation records. Atari TOS uses a very simplistic model: every N bytes, add the base address of the appropriate segment to the 32-bit word at that current location. Again, the assumption is that references are self-consistent/self-contained, so this is fine if all you’re doing is relocating a microcosm of code. For dynamic linking, however, we need to know not only where to perform the relocations, but also against what symbol. For this reason, I would replace individual bytes in the relocation segment with 32-bit words, with a layout along these lines:

 3 2 2         0
 1 4 3         0
+---+-----------+
| D |     d     |  Type 1
+---+-----------+

+---+-----------+
| D |     S     |  Type 2
+---+-----------+

This makes for a larger executable file, but is also substantially more flexible. The rules are as follows:

If D=255, then we have a type-1 relocation record, which basically says, “There’s no relocation at this point; however, the next relocation record is going apply d bytes from here in the file.”
If D<255, then we have a type-2 relocation record. This says to the loader, “Fix up the current 32-bit word with the value of the S-th symbol in the symbol table, then advance D (capital!) bytes in the program image.”

In both cases, we know if we’re making a code or data segment reference based both on the value of the referenced symbol, and on the relative offset we’re patching in the a.out file.

So far as I’m aware, this is all that is required to make the format support dynamic linking. Now let’s lay this stuff out into something that makes the task of loading it easy.

Refined Header Structures

Since the initial ph_branch is a 68000 BRA instruction that skips over the length of the program header, it follows that if we append our extra fields onto the program header, then we must adjust the branch offset as well. This implies we create a new magic cookie for ph_branch which the loader can use to determine if the binary is dynamically linked or not.

Just in case some other file type uses this same magic value, we’re going to use an unused ph_prgflags bit to identify a dynamically linked artifact. Let’s call it PRGF_DYNAMIC.

If ph_branch is correct yet PRGF_DYNAMIC is not set, then we either have a corrupt program header, or the file is not actually a dynamically linked executable. Thus, we reject the file.

The following fields can appear immediately after the ABSFLAG field:

typedef struct
{
   WORD  ph_branch;         // Magic: 0x6022
   LONG  ph_tlen;           // as before.
   LONG  ph_dlen;           // as before.
   LONG  ph_blen;           // as before.
   LONG  ph_slen;           // as before.
   LONG  ph_res1;           // as before.
   LONG  ph_prgflags;       // Make sure PRGF_DYNAMIC is set!
   WORD  ph_absflag;        // as before.
   LONG  phx_deplen;        // Length of dependencies.
   LONG  phx_implen;        // Length of imported symbols.
} PHDYN;

Observe that loading the module’s code and data segments is meaningless if we can determine ahead of time that we cannot satisfy this module’s dependencies. For this reason, we need only insert our list of dependencies and symbol imports, in that order, ahead of the code segment. In this way, you can still compute the offsets of everything without having to provide explicit pointers. This keeps the size of the header down. We also take care to order the segments we’ll need in the order we’ll use them.

We first find exported symbols, since it’s possible that dependencies may require a symbol defined by the executable itself. So, the loader must process locally defined symbols first.
Then, we transitively load dependencies using a depth-first traversal of the dependency tree.
Once all dependencies have been loaded, then we can resolve imported symbols for this module. If any are left unresolved, then we abort further loading with an undefined symbol error.
At this point, we can finally handle loading our TEXT, DATA, and BSS segments, along with any relocations they may require.

By doing things in this order, we minimize unnecessary seeking through the file.

+-------------------------------+
| PHDYN header                  |
+-------------------------------+
| exported symbols (new)        |
| (replaces old symbol table)   |
+-------------------------------+
| dependencies (new)            |
+-------------------------------+
| imported symbols (new)        |
+-------------------------------+
| TEXT segment                  |
+-------------------------------+
| DATA segment                  |
+-------------------------------+
| Relocation table (new)        |
| (replaces old reloc table)    |
+-------------------------------+

Normative Procedures

Here, we get to the basic algorithm of loading and relocating a dynamic executable using this format.

You’ll notice a reference to a global context. This is some arbitrary data structure that serves as a global record-keeping device keeping track of resources allocated and/or files opened. In Unix, this would be the process itself. For something like TOS, this structure would need to exist separately. Either way, it’s required if we want to properly track resources in the event we return with an error. Otherwise, resources will be allocated but unable to be freed, since we wouldn’t have a long-standing reference to them. The context also serves as a rendezvous point for symbols, where they’re defined, and what value they equate to.

First, we start out at the top level: when loading a PRG, we dispatch to the appropriate loader based on ph_branch values.

PROCEDURE LoadPRG(filename)
BEGIN
    Read first two bytes of file.
    IF ph_branch == 0x601A THEN
        Load statically linked PRG file.
    ELSIF ph_branch == 0x6022 THEN
        Create global loader context.
        LoadDynamicPRG(filename, context)
    ELSE
        Return with an unrecognized file type error.
    END
END

The legacy loader, as defined in GEMDOS, can be used to load the statically-linked executable, so I won’t repeat its pseudo-code here. Instead, let’s focus on the dynamic loader.

PROCEDURE LoadDynamicPRG(filename, context)
BEGIN
    Read rest of the PHDYN header.
    IF PRGF_DYNAMIC not in ph_prgflags THEN
        Return with an unrecognized file type error.
    END

    // Process locally defined symbols so that any dependencies
    // can find them and use them.

    FOR ALL s IN locally defined symbols DO
        Create symbol descriptor for s in the context.
    END

    // Transitively load all dependencies.

    FOR ALL d IN dependencies DO
        Resolve dependency d to library filename.
        LoadDynamicPRG(Filename(dependency), context)
        IF not successful THEN
            Free all allocated or opened resources so far.
            Return with an error code.
        END
    END

    // We've just loaded all of our dependencies, and they are
    // happy with the symbols we've exported.  Let's now resolve
    // our own symbol imports.  This can potentially take some
    // amount of time; O(n^2) algorithm.

    FOR ALL s IN imported symbols DO
        FOR all dependencies we loaded above DO
            IF dependency contains symbol THEN
                Create descriptor for s in the context.
                Exit this loop and continue outer loop.
            END
        END
        // No dependency claims to define the symbol.
        // Abort with an undefined symbol error.
        Free all allocated or opened resources so far.
        Return with an error code.
    END

    // We transitively loaded our dependencies.  We've resolved
    // all of our symbols.  Now it's time to load our text and data
    // segments and apply relocations to them.

    AllocateTextSegment()
    ReadTextSegment()
    AllocateDataSegment()
    ReadDataSegment()
    AllocateBssSegment()
    ZeroBssSegment()

    offset = 0
    segment = TextSegment
    IF length(relocations) > 0 THEN
        FOR ALL r IN relocations DO
            IF
                offset >= length(TextSegment) AND
                segment = TextSegment
            THEN
                segment = DataSegment
                offset = offset - length(TextSegment)
            END
            IF
                offset >= length(DataSegment) AND
                segment = DataSegment
            THEN
                segment = BssSegment
                offset = offset - length(DataSegment)
            END
            IF IsType2(r) THEN
                x = GetWord(segment+offset)
                v = SymbolValue(SymbolDescriptorFor(r))
                PutWord(segment+offset, x+v)
            END
            offset = offset + displacement(r)
        END
    END
END

This pseudo-code is more or less representative of the actions that must be taken for someone to extend GEM’s PRG format to include dynamic linking abilities. Actual code is straight-forward, the code follows the layout of the data as found in the executable file, and should be easily maintained as future requirements dictate.

This pseudo-code does not handle the case where a shared library is already resident in memory. However, with a small change to the calling convention of the LoadDynamicPRG procedure, it should be possible to short-circuit the loading process for resident libraries.

Handling Other Processor Architectures

The pseudo-code above assumes the same 680x0 processor family that drives the Atari ST/TT line. As written, it won’t scale to newer architectures. Many RISC processors requires multiple kinds of relocations to resolve references, since instructions rarely can pack a full word’s worth of bits in an instruction. For example, a RISC-V processor can only address +/- 2KiB from some base address. For a 32-bit address, then, you need to break the reference up across two instructions: a LUI instruction to load the upper 20 bits of the address, and an addition to contribute the lower 12 bits. Worse still, the addition is sign-extended, which must be accounted for in the upper portion of the address.

A 64-bit RISC-V implementation can’t even use this approach, for there is no way to embed the upper 32-bits of an address directly into the instruction stream. You end up loading addresses indirectly from a code or data word of memory, and that is what receives the relocation. Alternatively, if the value has lots of binary 0s in it, you can load a portion of the address in the low-order bits and then shift up appropriately. As you can see, resolving a 64-bit address on RISC-V can potentially be messy.

Assuming a 64-bit RISC-V target, the loader will need to understand how to perform a minimum of four different kinds of relocations:

Absolute 32-bit relocation
Absolute 64-bit relocation
Upper 20-bits for LUI
Lower 12-bits for ADDI

This can be encoded easily enough in the relocation segment of the file. You just need to make the relocation record format known to the loader somehow, either implicitly through the ph_branch magic cookie, or via additional header fields.

Handling Initializations and Finalizations

Some languages, like Modula-2 and C++, often require objects to be pre-initialized via “constructors” before the main program begins. In the case of Modula-2 and Oberon, the “constructor” is the global body of a module definition. In C++, D, Sather, et. al., these are provided through more explicit programming constructs on classes. Either way, they must be invoked for the main program to run correctly. After all, the executable might be written in plain-vanilla C, while the library you’re depending upon could very well be written in C++!

In a statically linked executable, this is not a problem, since the linker is given libraries which are capable of handling this issue. It has compile-time (at best) or link-time (at worst) knowledge of which routines to call and when. Thus, it can provide supporting code in the .text segment, and the day is saved.

For dynamically linked executables, however, it’s not as simple. Since the linker and loader both have no idea what languages modules are written in, the loader must support a means of calling constructors independently of the main program’s entry point. The ELF .init, .fini, and related sections all exist to support languages which require global constructors to be called before invoking the main program.

We map this to the a.out file by requiring initializers and finalizers to exist in the .text segment. Further, instead of providing a single routine to call, we express requirements as a vector of routines to call. These can be handled by extending the a.out header once more with the following fields:

LONG    phx_ivecbase;   // Offset to base of initializer vector
LONG    phx_ivecsize;   // Number of initializers to call
LONG    phx_fvecbase;   // Offset to base of finalizer vector
LONG    phx_fvecsize;   // Number of finalizers to call

From the loader’s point of view, there’s nothing special about these vectors. The linker must provide zero or more such vector entries for each module it helps produce. The loader, then, must take on the responsibility of invoking them as it loads module dependencies.

ELF versus Hunk Format

First, I want to get this off my chest. There is no such thing as the Amiga Hunk Format. It’s just … Hunk Format. Remember, AmigaDOS is just a port of the Tripos operating system to run under the exec.library kernel. You can find more about Commodore-Amiga’s rebranding of Tripos at this Page Table article. I highly recommend it; it’s actually quite fascinating.

Now that that’s cleared up, let’s talk about hunks. Big hunks, heavy hunks, small hunks, hot hunks, however you prefer them to be. A hunk is nothing more than a self-identifying, often explicitly length-delimited, array of integers. Like so:

+-----------------------------+
| Hunk Type ID                |
+-----------------------------+
| # of words (4 in this case) |
+-----------------------------+
| 1                           |
+-----------------------------+
| 2                           |
+-----------------------------+
| 3                           |
+-----------------------------+
| 4                           |
+-----------------------------+

In the Amiga’s case, these words are always, always, 32-bits wide. That’s because dos.library (a.k.a. Tripos) was written in a 32-bit dialect of BCPL. In STS’s case, these words were always 16-bits wide. For a 64-bit RISC-V port, they could well be 64-bits wide. I think you get the idea.

A hunk formatted file, then, is just a file with an array of hunks in it. That’s it. You now understand the structure of a hunk formatted file.

You’ll notice that there’s no mention of text, data, or BSS segments. No relocations. No symbol definitions or imports. That’s because hunk format is a container format, just as MP3 is a container for audio, MP4 for different kinds of video, etc.

Next step, then, is to learn how dos.library and Tripos actually used it to store executable artifacts.

Amiga and Tripos Binary Files

The Hunk Format executable is naturally multi-processor capable. Metacomco defined it for the Motorola 68000 processor originally, the details of which you can read in the Amiga Binary File Specification. Since then, however, the Amiga community moved to support the PowerPC processor architecture. Although modern versions of AmigaOS support ELF, originally, they simply adopted the Hunk Format to include PowerPC-specific hunks.

So, right out of the gate, we plainly see hunk format supports multiple processors. The fact that I used the basic concepts in a 16-bit OS of my own design also illustrates that hunk format is not constrained to just AmigaOS. So multi-processor and platform independence have never been unique selling points for ELF. And we haven’t even begun to look at the details of hunk format yet.

For the purposes of this article, however, I will once again concentrate just on the 68000 version of the file.

The large-scale structure of an Amiga executable looks more or less like this:

+---------------+
| HUNK_HEADER   |
+---------------+
| HUNK_CODE     |
+---------------+
| HUNK_RELOC*   |
+---------------+
| HUNK_DATA     |
+---------------+
| HUNK_RELOC*   |
+---------------+
| HUNK_BSS      |
+---------------+
| HUNK_END      |
+---------------+

The layout of a hunk file is intended to be processed with an “event-driven” parser, which looks more or less like this skeleton pseudo-code:

f = open("hunkfile","rb");
size = read(f, &id, 4);
done = FALSE;
do {
    size = read(f, &id, 4);
    if(size < 4) goto done_loading;

    switch(id) {
    case HUNK_HEADER: ...
    case HUNK_CODE: ...
    case HUNK_DATA: ...
    case HUNK_BSS: ...
    ...etc...
    case HUNK_END:
        done = TRUE;
        break;
    default: // unknown hunk type?  OK, skip it.
        read(f, &size, 4);
        seek(f, size*4, SEEK_OFFSET);
        break;
    }
    
done_loading:
} while(!done && (size >= 4));
close(f);

As you can see, the very structure of a hunk file is intended to simplify the loader a great deal.

Header Hunk

HUNK_HEADER hunks identify the file as an executable as well as gives due warning to the loader on the number of segments to allocate and how big they are. But, hark, look what else it provides!

+-----------------+
| HUNK_HEADER     |
+-----------------+ -+
| N1              |  |  Names of shared library
+-----------------+  |  dependencies.
| N1 name words   |  |
+-----------------+  |
| N2              |  |
+-----------------+  |
| N2 name words   |  |
+-----------------+  |
|       ...       |  |
+-----------------+ -+
|        0        |
+-----------------+
| Hunk Table Size |
+-----------------+
| First hunk      |
+-----------------+
| Last hunk       |
+-----------------+
|                 |
| L-F+1 Sizes     |
|                 |
+-----------------+

This hunk includes the names of shared library dependencies. The loader is responsible for opening (and loading if not already in memory) each specified library prior to loading the rest of the binary. In this way, transitive dependencies are satisfied.

Assuming this has been done successfully, the next step is to allocate an array of segment descriptors, whose size is specified by the hunk table size field. It then copies the libraries’ hunk descriptors into this image’s descriptor table. From the loader’s perspective, all your shared libraries have been included in this image’s hunk table as if the contents of those libraries came from this executable file. This greatly eases relocating hunks without having to traverse a complex graph of dependencies.

(Aside: at this point, if you wanted to, you could theoretically deallocate individual library segment tables, since the relevant references are now included in this executable’s segment table. But, it’s not worth it; if these are reusable libraries, they’ll be referenced again later.)

Once all the libraries have been opened and their segment tables merged into this executable’s segment table, then the fun part happens: the loader reads the remainder of the hunk to get at the list of hunk sizes. This gives the loader an opportunity to pre-allocate all the segments it needs to hold the binary in memory. These segments will take their place sequentially in the hunk table starting at the First hunk index, and continuing until the Last hunk index, inclusive.

So, if we have a clock executable that includes (for sake of argument) libc.so and libX11.so, then the loader’s segment table could well look a lot like this:

    +----------------+
0   | libc.so CODE   |---->
    +----------------+          points into
1   | libc.so DATA   |---->     libc.so's
    +----------------+          segments
2   | libc.so BSS    |---->
    +----------------+
3   | libX11.so CODE |---->
    +----------------+          points into
4   | libX11.so DATA |---->     libX11.so's
    +----------------+          segments
5   | libX11.so BSS  |---->
    +----------------+
6   | clock CODE     |---->
    +----------------+          allocated,
7   | clock DATA     |---->     but not yet
    +----------------+          loaded.
8   | clock BSS      |---->
    +----------------+

Gosh, this looks an awful lot like the information made accessible in Linux /proc/$PID/maps files, wouldn’t you agree?

The actual HUNK_HEADER will look vaguely like this in the file:

+-----------------+
| HUNK_HEADER     |
+-----------------+
| 2               | Shared object dependencies
+-----------------+ start here.
| 'l' 'i' 'b' 'c' |
| '.' 's' 'o'  0  |
+-----------------+
| 3               |
+-----------------+
| 'l' 'i' 'b' 'X' |
| '1' '1' '.' 's' |
| 'o'  0   0   0  |
+-----------------+
|        0        |
+-----------------+
| 9               | Total hunk table size.
+-----------------+
| 6               | Clock's first segment start at 6, ...
+-----------------+
| 8               | and ends at segment 8.
+-----------------+
| seg 6's size    | Clock's code segment.
+-----------------+
| seg 7's size    | Clock's data segment.
+-----------------+
| seg 8's size    | Clock's BSS segment.
+-----------------+

After processing the header, the loader now has a complete description of a program image in memory. It just doesn’t have any of the specific details hammered out yet. For that, we need to load our segments with data.

Hunks for Code and Data

HUNK_CODE hunks are for providing what would be considered .text or .rodata content in other formats. This is where your executable … well, code goes. Similarly, HUNK_DATA hunks provide statically allocated/initialized data that would fall in a .data section in other formats.

+------------------+
| HUNK_CODE        |
+------------------+
| # words in image |
+------------------+
|                  |
| code             |
|                  |
+------------------+

+------------------+
| HUNK_DATA        |
+------------------+
| # words in image |
+------------------+
|                  |
| data             |
|                  |
+------------------+

A BSS segment is specified in a similar manner; however, since its content is implicitly undefined, we only need to specify how big it is.²

+--------------------+
| HUNK_BSS           |
+--------------------+
| # words in segment |
+--------------------+

Note that, just like ELF but unlike a.out, you can have any number of code, data, or BSS segments. This was useful for earlier versions of AmigaOS, which did not automatically defragment free chunks of memory. As a result, there was an incentive to have many smaller code, data, or BSS segments as a means of working around the memory fragmentation that would inevitably build up like a layer of mold. You had to be careful though; too many small segments would actually exacerbate the problem, especially after you ran and quit an executable linked in this manner several times on a 256KB or 512KB Amiga! (Thankfully, I believe this issue was fixed with AmigaOS 2.0 or 3.0, so the incentive to coalesce like-typed sections became much stronger.)

Anyway, back to the hot hunks.

The very first code, data, or BSS hunk that the loader encounters in the file will obviously be used to load the very first hunk that is defined to belong to this file. Recall in the header that this was specified by the First hunk field. Each subsequent code or data hunk would fill subsequent hunk in the hunk table. Any code or data hunks that would exceed the table would be ignored.

Relocations

Relocation hunks come in many varieties. I’m only going to illustrate one: HUNK_RELOC32.³ As with the GEMDOS file format, relocations are to 32-bit absolute addresses only. By the time you’re loading something into memory, all PC-relative or base address-relative offsets should have been resolved by the linker first. There also is a HUNK_PPC_RELOC26 for PowerPC branch relocations as well.

The HUNK_RELOC32 hunk looks like this on disk:

+----------------+
| HUNK_RELOC32   |
+----------------+
| N1             |
+----------------+
| Segment 1      |
+----------------+
|                |
| N1 offsets     |
|                |
+----------------+
| N2             |
+----------------+
| Segment 2      |
+----------------+
|                |
| N2 offsets     |
|                |
+----------------+
:                :
+----------------+
| 0              |
+----------------+

Basically, this is saying:

Apply N₁ relocations to the most recently loaded segment by adding the base address of segment 1 to the 32-bit values at all these offsets,
Apply N₂ relocations to the most recently loaded segment by adding the base address of segment 2 to the 32-bit values at all these offsets,
and so on.

This happens until we reach the end of the list of relocations, identified by the final 0 value.

Remember that relocations always apply to the most recently loaded segment. This is why relocations for a segment always follows the segment to which it applies. You can have any number of different types of relocations following a code or data hunk.⁴

Also note that it’s perfectly OK to refer to a segment which hasn’t been loaded yet. Recall that while processing the HUNK_HEADER, the loader knows how big each segment will be and can thus pre-allocate them. Obviously, you can also refer to segments which belong to shared libraries.

Tripos Shared Libraries

We’ve seen that Hunk Format executables share just about every major feature with ELF out of the box. It’s fully capable of supporting ASLR, and there’s nothing which prevents it from being used in either single- or multiple-address spaces. The only thing it doesn’t support is statically linked executables pre-designated for a specific load address. Further, it supports dynamic linking.

Just not the way AT&T specified with ELF.

See, there’s a frailty with how Tripos handles shared objects: the executable needs to know at link-time where the library’s assets are located (which segment and the offset within that segment). On its own, this places a hard dependency on a specific version of a library. Change the library to a newer version, and those assumptions could well break.

For this reason, AmigaOS does not support Tripos-style shared libraries. Instead, it uses its own library system which exec.library provides, which addresses this exact issue by using well-defined offsets on a jump table, itself relative to a kind of handle to the library. This lets specific addresses of various assets float within their respective binaries, with only the relative offsets being well-known at compile-time.

Exec’s jump table is exactly like ELF’s PLT (Procedure Linkage Table) segment, only instead of your program constructing it, it’s part of the library you’re trying to call. Any publicly available data is similarly made available relative to this library base pointer. This has several advantages over ELF:

No need to keep symbol tables around, so it uses less memory.
Loading is much faster without having to scan for symbols all the time.
It’s faster than the ELF-style PLT. You’re jumping to an unconditional jump instruction, which modern CPU pipelines handle quite elegantly.

Of course, it comes at a cost: you can’t just use dlsym() to resolve a symbol on a library.

How to Make Tripos Shared Libraries Work Like ELF’s

You can definitely get some great milage out of Tripos’ approach to shared libraries, despite the aforementioned limitations. Tripos, circa 1981, was successfully using them to make command-line tools as small as 512 bytes, while comparable tools in Linux, even with ELF’s shared objects, require closer to 20KB. Even accounting for binary size differences of 16-bit and 32-bit platforms, we’re still looking at an order of magnitude smaller executables for Tripos than we are seeing for Linux. Arguably, Tripos wins at shared libraries.

But, OK, we want the ability to use C++, and we want unique data segments per process. How do we get these?

Relieve the linker of detailed knowledge of dependencies. This is not a prerequisite for normal, day to day use of Tripos shared libraries. But, it is definitely a requirement if you want symbols defined in the executable to be usable by those libraries. There are two ways of resolving this issue. Either introduce a replacement hunk type for HUNK_HEADER that implies the new semantics, or specify the use of negative hunk indices as a flag to the loader. For instance, you can declare bit 31 of First hunk field to be set for “module-relative” hunk indices or some such. That way, the loader can reference library hunks when that bit is clear, and module-relative hunks when set. This would apply to all hunk indices, including inside relocations. The probability of exceed two billion hunks is asymptotically zero on 32-bit systems, and particularly so on 64-bit systems, so this is a pretty safe extension.

Retain symbol imports and exports from the linker stage. This is a requirement that’s been discussed above under the GEMDOS section; it’s required if you want to allow libraries to bind to symbols you create in your executable.⁵ The loader of a module would need, then, not only to maintain hunk tables, but also symbol tables. This would enable, in turn, the next item.

Allocate new data segments on each successful load. Since libraries and executables are statically linked in Tripos, code hunks can freely reference data hunks, thus eliminating completely the need for GOTs. This implies that Tripos libraries (as with Exec-style libraries) maintain a shared body of data across all clients of the library. ELF mandates this cannot happen. By definition, each new program is in its own address space, so each library shares only its code; and at that, this code can appear anywhere in the process’ address space. All data hunks must be remapped for each address space, uniquely. To enable this to happen, the loader must recreate data and BSS hunks for each client use-case.

Page aligned code and data segments. This is not a hard requirement; don’t let anyone tell you otherwise. Even given an executable with non-aligned segments, you can still mmap them into memory. You need only give special consideration to the leading and trailing pages, which may have to be loaded by copying from the file if their boundaries aren’t page aligned. Given a 4KB page size, this means that you end up transferring no more than 8KB of content. For larger binaries (which are everywhere on my Linux box, by the way), the overhead of this would be immeasurable, as the overwhelming bulk of binary content would be memory-mapped as normal. However, if this is such an important feature to have, then one could introduce two new hunk types:

HUNK_NULL, which is a null hunk. It consists only of the HUNK_NULL ID. No size or payload field follows.
HUNK_SKIP, which contains a size field. It’s sole purpose is to cause the loader to skip ahead in the file. The following hunk, presumably a code or data hunk, can then be placed such that its content is page aligned, and so can be mmaped per normal Unix conventions.

The HUNK_SKIP hunk allows for coarse-grained alignment control within the executable, while HUNK_NULL provides alignment control at individual word granularity.

Support for initializers and finalizers. This can be supported fairly easily using two new hunk types: HUNK_INIT and HUNK_FINI. As with my GEMDOS modifications above, these would contain vectors of initialization functions to call. Unlike GEMDOS, however, these would not occupy any code or data space. Rather, the loader would read in a set of hunk and offset pairs, intended to point at initialization and/or finalization functions, and call them sequentially in the order provided.

+-------------------+    +-------------------+
| HUNK_INIT         |    | HUNK_FINI         |
+-------------------+    +-------------------+
| N                 |    | N                 |
+-------------------+    +-------------------+
| Code Segment 0    |    | Code Segment 0    |
+-------------------+    +-------------------+
| Code Offset 0     |    | Code Offset 0     |
+-------------------+    +-------------------+
| Code Segment 1    |    | Code Segment 1    |
+-------------------+    +-------------------+
| Code Offset 1     |    | Code Offset 1     |
+-------------------+    +-------------------+
:                   :    :                   :
+-------------------+    +-------------------+
| Code Segment N-1  |    | Code Segment N-1  |
+-------------------+    +-------------------+
| Code Offset N-1   |    | Code Offset N-1   |
+-------------------+    +-------------------+

So far as I am aware, this is all that is required for Hunk Format to attain parity with ELF. Arguably, it is simpler to get Hunk Format working like ELF than it is for GEMDOS, particularly since the whole arrangement is built to be extensible via an event-driven loader.

As with the GEMDOS loader, these new semantics can be selected by altering which kind of header hunk appears at the head of the file. If you want the older-style, Tripos semantics, use HUNK_HEADER. For the newer, ELF-like semantics, use (just inventing a name here) HUNK_HEADER_MAS (for multiple address space support).

Though, honestly, considering Tripos libraries overwhelmingly superior memory savings over ELF-style, I’m not sure why you’d ever want to.

Conclusion

I’ve shown how one can build a dynamic loader for the a.out format. It involves adding a few missing fields and pseudo-segments to the file, but it is not impossible, and it’s patently much simpler than ELF’s current file structure.

I’ve shown how Hunk Format executables already implements the vast majority of ELF’s more salient features. It should be noted that Hunk Format predates ELF by almost two decades. It is possible to retrofit ELF-like behavioral patterns onto Hunk Format, but it’s not clear to me that this delivers a technical win. The memory savings one gets with a Tripos-style shared library is rather significant, even after accounting for instruction set density differences between processors contemporary in 1978, 1988, and 1998.

The one thing I would consider unconditionally retrofitting onto Hunk Format loaders, though, is the use of negative indices to the hunk table to refer to module-local hunks. It will make life easier for anyone looking to independently upgrade libraries from their clients. Knowing that your module’s hunks starts at offset 6 in the hunk table (e.g., as in our clock example above) seems klunky to me, and to an extent breaches the encapsulation the loader’s intended to provide. I’ll need to research this further some day.

As well, I would continue to provide mezzanine documentation on conventions to help ameliorate versioning-related incompatibilities. This documentation would include, for instance, how to discover where a library’s jump table resides, etc. However, these things are complimentary to Tripos-style libraries, and do not replace them.

Personally, I think Unix System V Release 4 would have been a lot better off if they’d built on top of Hunk Format executables instead of relying on ELF. There’s definite memory-savings potential from not having everything and their grandmother embedding a symbol table, the lack of a (or at least a smaller) GOT means programs run faster since fewer indirection is happening all the time, and, allowing a library to keep shared state across all its clients is a really useful thing to have, enabling libraries to offer first-class, system-wide services. While not intending to serve as a complete replacement, there would be reduced pressure for the use of kernel modules, which are platform-specific and often with strange interface requirements, to serve these needs. This, in turn, would lead to a more modular and robust user-land experience without sacrificing the architectural benefits that a monolithic kernel can provide.

Anyway, there you have it: not just one; but, two technologies which existed at the time ELF was invented, which with a little bit of thought could have easily subsumed all the use-cases of ELF, and which didn’t require 186 pages of documentation to describe poorly. It’s regrettable that neither of these technologies had the weight of AT&T backing them.⁶

1: For a quite compelling argument against this point of view, please read this collection of quotes against dynamic linking in general, and a uniquely ELF-related concept of versioned symbols.

2: It is interesting that we must give a hunk size for HUNK_BSS, even though nothing actually follows it. The authors of Tripos’ loader could have just as easily declared HUNK_BSS to just be a null hunk, but instead chose an extraneous size word to follow. Please note, however, that the BSS size specified in the HUNK_HEADER takes priority over the supplied size in the HUNK_BSS hunk.

3: Starting with AmigaOS 2.0, you can also use HUNK_RELOC32SHORT hunks to basically include the exact same information as above, but more compactly, using 16-bit words instead of 32-bit words. This is a space saving measure, since most segments for AmigaOS tend to be 64KiB or smaller.

4: It is technically illegal to place relocations after a BSS hunk. I’m not personally aware of any version of AmigaOS which will throw out such a binary, however. I suppose, if you’re crafty enough, you can use these relocations to pre-load pointers placed in BSS. I don’t recommend this, for obvious reasons; your program could well break on any revision of the operating system’s loader.

5: I think this is one of ELF’s worst features. A library should always be something that you depend upon; the library should never have to depend on you. When the latter happens, you really have a degenerate framework, and as we all know, frameworks are evil.

6: AT&T certainly did have control over the a.out header format for their Unix distributions prior to their switch to COFF with Unix System V Release 1. However, looking at the Wikipedia article on COFF, I’m utterly bamboozled as to why AT&T even bothered with it, when even a little bit of research in the matter would have shown hunk format to have both existed and to have met most of their needs at the time. As I’ve illustrated above, adding a few different hunk types would have added the remaining semantics that it was missing. For some reason, this never crossed anyone’s mind. Why? I can only think that, like so many other computer-related companies at the time, AT&T Bell Labs suffered dearly from Not Invented Here Syndrome.

Samuel A. Falvo II
Twitter: @SamuelAFalvoII
Google+: +Samuel A. Falvo II

About the Author

Software engineer by day. Amateur computer engineer by night. Founded the Kestrel Computer Project as a proof-of-concept back in 2007, with the Kestrel-1 computer built around the 65816 CPU. Since then, he's evolved the design to use a simple stack-architecture CPU with the Kestrel-2, and is now in the process of refining the design once more with a 64-bit RISC-V compatible engine in the Kestrel-3.

Samuel is or was:

a Forth, Oberon, J, and Go enthusiast.
an amateur radio operator (KC5TJA/6).
an amateur photographer.
an intermittent amateur astronomer, astrophotographer.
a student of two martial arts (don't worry; he's still rather poor at them, so you're still safe around him. Or not, depending on your point of view).
a former semiconductor verification technician for the HIPP-II and HIPP-III line of Hifn, Inc. line-speed compression and encryption VLSI chips.
the co-founder of Armored Internet, a small yet well-respected Internet Service Provider in Carlsbad, CA that, sadly, had to close its doors after three years.
the author of GCOM, an open-source, Microsoft COM-compatible component runtime environment. I also made a proprietary fork named Andromeda for Amiga, Inc.'s AmigaDE software stack. It eventually influenced AmigaOS 4.0's bizarre "interface" concept for exec libraries. (Please accept my apologies for this architectural blemish; I warned them not to use it in AmigaOS, but they didn't listen.)
the former maintainer and contributor to Gophercloud.
a contributor to Mimic.

Samuel seeks inspirations in many things, but is particularly moved by those things which moved or enabled him as a child. These include all things Commodore, Amiga, Atari, and all those old Radio-Electronics magazines he used to read as a kid.

Today, he lives in the San Francisco Bay Area with his beautiful wife, Steph, and four cats; 13, 6.5, Tabitha, and Panther.

Kestrel Computer Project

Aiming to build a full-stack, open source, and open hardware home computer.