29 Jan 2018
The Executable and Linkable Format, ELF, is too complex. Considerable effort is invested to specify the file structure, yet little effort is invested in specifying normative procedures when working with the file format. Regrettably, there currently is no solution to this problem; the standard has spread like mold on bread, and has permeated every facet of software development on non-Windows platforms. At best, you can find tools which complies with the standard and which implements the semantics closest to your requirements; at worst, you can migrate away from using ELF all-together, but this incurs real development costs.
I bet, when ELF first came to be known, people must have thought the world of it. Here was a file format which promised to unify loader and linker formats alike, it was, ostensibly, an open standard providing a compatible container for both executables and shared libraries alike, and which provided this service on a number of different operating systems and for different processor architectures. Since its introduction, ELF has been adapted for use on 8-bit, 16-bit, 32-bit, and more recently 64-bit CPUs. There’s no evidence to suggest that ELF won’t be used to describe 128-bit software when the time comes.
However, while the glorified glossy ads in popular computer rags were convincing to what seems like everyone else except me (indeed, you can still find pedestal pieces even as recently as 13 years ago), actually getting my feet wet with this file format proved utterly corrosive to my views on ELF technology.
I found that the tooling surrounding the standard must have started relatively simple, but has since evolved into a complex and incomprehensible quagmire of standards and tribal knowledge since then. This inexorably lead to the platform-specific dependencies and special cases we see today. As it happens, these have real-world costs. 1
This article is the first of two, a screed containing my informed opinions about the ELF standard. I must warn you; I’m not kind to ELF. The next article will attempt to compare ELF to binary formats that I personally have experience with and which were contemporaries of ELF when it was first introduced. In it, I’ll also discuss ways that those formats could have been evolved incrementally to support all the features we’ve come to expect and love from ELF. But, before we get to the good stuff, let’s set the stage explaining exactly why I hate ELF so much.
ELF was originally introduced with AT&T Unix System V Release 4. On the page linked previously, an introductory paragraph reads:
By design, ELF is flexible, extensible, and cross-platform, not bound to any given central processing unit (CPU) or instruction set architecture. This has allowed it to be adopted by many different operating systems on many different hardware platforms.
While I concede that ELF is flexible, I will argue later on that this is hardly unique to ELF. Is it extensible? Yes, but I’ll show how doing so is awkward. There are other formats which are much more cleanly extensible. Is it cross-platform? This depends on how you define the term. While the structure of the files are definitely cross-platform, interpretation of the file’s contents most certainly is not. This can be seen by the wide variety of ABI standards which the interested reader can find for various processor architectures, all of which fills in the gaps that are missing from the generic ELF standards. But, here again, I will argue this is hardly unique to ELF. Indeed, one of earliest cross-platform executable formats was the infamous a.out format. I’ll show an example of this in the next article as well.
If you’re like me, you probably haven’t worked at AT&T Bell Labs when ELF was first invented. This means you’ll need to rely on Google to find your specifications. So, I’ll happily admit here and now that my sources may not necessarily be the most authoritative.
That said, when I look for the ELF Specification, I’m lead to a document that consists of 60 pages. This is already fairly heavyweight, if you ask me; but, let’s give it the benefit of the doubt. Remember, this format aims to fulfill three distinctly different goals:
Lets ignore that specifications for other loader formats that predate it are much smaller; if we average 20 pages of specification per role, it seems ELF is a manageable, even if mildly inconvenient, format.
But, wait, it doesn’t stop here.
When ELF was invented, 64-bit systems were not popular. It needs an additional 18 page annex to properly cover 64-bit systems.2 This annex assumes familiarity with the preceding 60-page document, so now we’re at, rounding up, 80 pages to read.
But, wait! There’s more!
After reading a whopping 80 pages, you’ll quickly realize that you still don’t know how to properly load an ELF file. What is entailed with loading an ELF file? That depends on which one of the three kinds of binary artifacts it represents. If all you’re loading is raw binary data which requires no relocation at all, you can just read the file into an arbitrarily placed buffer and you’re done. This is, perhaps, where the idea that ELF was “simple” comes from. However, it’s a red herring. As soon as relocations are involved, you’re going to be wishing you’d never used ELF in the first place.
See, to learn how to properly load a binary, regardless of the kind, you now need to read through the complete specifications that binds the ELF standard to your specific microprocessor. In the specific case of AMD’s x86-64 platform, that document takes a whopping 108 pages. That’s 108 pages to tell you such things as:
Long story short,
to get an inkling of an idea of how to properly apply ELF to your loading needs,
you need to basically read through 186 pages of documentation.
The sad part is, even after all this documentation,
you still won’t know definitively how to actually go about applying this knowledge.
This is where the tribal knowledge comes into play.
Want to build an ELF loader? Better read the source to
Oh my goodness, you can’t make sense of it?
Tsk tsk, you don’t have any business building a loader;
you’re clearly unskilled in the art.
As I indicated earlier, static executables are almost too simple.
A loader basically performs
mmap to memory-map the file into a process’ address space,
then performs the ABI’s requirements for process bring-up, and dispatches accordingly.
This is because the linker,
whose name is
ld in Unix,
is responsible for not only linking the final executable,
but also pre-relocating the image into its final resting place in the process’ address space.
In short, the loader (hence the linker’s name!) has already done its work,
and is just waiting for the kernel to basically load the image raw into a process
and dispatch to it to get things moving.
So, if that’s the case, why do we need a special executable format at all?
.COM files were more flexible than Unix statically-linked executables.
I’m not even joking; COM-files could at least appear anywhere in the real-mode address space,
modulo 16-byte alignment requirements.
(I should point out that RISC-V raw binary images share this characteristic;
you can get very far indeed with nothing but raw binary images with this processor ISA.
Better yet, you don’t even have segmentation-imposed alignment restrictions like you do in
Due to the limitations of the processors Unix historically ran upon, combined with the needs for inter-process isolation on timesharing systems, Unix executables have a hard requirement on the existence of a page-capable memory management unit. Statically linked executables are expressly incompatible with segment-based memory protection; I mean, you could conceivably make it work depending on how segments are implemented on your platform, but even assuming something so segment-friendly as the GE-645 or Intel x86 architectures, you’ll still suffer a serious performance regression the moment swapping becomes necessary, as you must swap the entire code, data, BSS, and/or stack segments out in their entirety. Hope your programs aren’t much larger than 4KiB, and hope even moreso that they don’t deal with a working-set larger than about 64KiB.
Even in the Unix environment, where you’d expect pre-loading a binary artifact to a specific address seems to be the right thing to do, you’ll find problems of a different sort: security. There exist a whole class of attacks on software which can be prevented only through address space layout randomization (ASLR). Unix-style static executables are potentially prone to these kinds of attacks.
Finally, statically-linked ELF executables cannot make use of dynamically-linked anything, ELF or otherwise, without expending a great deal of effort to do so. (You basically need to embed an entire ELF loader in your application.)
So, one must wonder why it’s at all desirable to express such artifacts in an ELF container. It just doesn’t make sense: you’re adding unnecessary headers that serve no real purpose. You might as well just load from an a.out file. ELF is clearly a waste here.
This section focuses just on shared objects. These are files which, in turn, falls into two sub-categories:
I’ll get to PIE-specific stuff in the next section. For now, what appears in this section applies to both libraries and PIEs.
One good thing about shared objects is that they can be located anywhere in the process’ address space. Inasmuch, they’re probabilistically immune to attacks that depend on a known link address. Thus, ASLR is an effective counter-measure to these kinds of attacks. Yay!
However, supporting these kinds of artifacts quickly proves an exercise in dementia.
The data structures describing the shared object forms a graph,
and not a pleasant one to traverse, either.
This means, if you want an acceptably performant loader,
you cannot just read-and-seek through the file,
for you’ll be doing this a whole heck of a lot!
Most loaders instead use
mmap to memory map the whole ELF file,
then treat the whole thing as one giant network of C structs.
I’m not even kidding; google for ELF loaders if you don’t believe me.
This means that, during the loading phase, you’re potentially using twice the minimum memory required
(at least during the load process).
For desktop systems, not such a big deal.
For cloud or timeshared systems, this can seriously impact your neighbors, especially if you’re loading big binaries.
(Here’s looking at you, Java virtual machine, compiled HipHop resources, et. al.!)
Hope your binary isn’t too big!
The first oddity is that there exists two ways of looking at a shared object:
These may or may not be mutually exclusive; I just don’t know. Nothing I’ve read about ELFs so far says these have to be mutually exclusive. This is probably why every shared object file I’ve seen to date has both program headers and section headers.
Program headers are probably one of the few things I actually like about ELF. They describe the large-scale structure of the contents of the ELF file, like what their preferred virtual address is3, what memory protections the segment has overall (e.g., read-only for constants, executable for code, and read-write for data), and so forth.
The problem I’ve found with PHDRs (as they’re called more conveniently) is that you must iterate through them to find the segments you actually want to load. Not all PHDRs describe segments for code or data; some describe meta-data.
Near as I can tell, the information contained in such a “segment”
duplicates all the information that you could access
(sometimes more conveniently, sometimes not so conveniently)
in the file’s section headers, or SHDRs.
Let me make this clear:
in not so many words,
a loadable ELF file has two ways of accessing every piece of information you need to load it.
PT_DYNAMIC segments appear to be the preferred/authoritative way to load information in a file,
it seems to me that having section headers also is a waste.
Once again, we find ELF to be wasteful of its space at worst, and unnecessarily confusing at best.
It doesn’t stop there, though.
The layout of
PT_DYNAMIC segments is that of a “tag list”.
(If you’re a software developer familiar with AmigaOS 2.0 or later,
you’re already intimate familiar with tag lists.)
This is markedly different from how sections are managed,
wherein they have an array of higher-level data structures to describe them.
To compensate for their limitations,
loaders interpreting a
PT_DYNAMIC tag list
often must treat several tags as a tiny structure unto themselves.
DT_JMPREL specifies the start of a list of relocations.
You don’t know unless you consult both the ELF file header
(to determine 32/64-bit width)
DT_PLTREL tag, to find out if the relocations need addends or not.
Once you have this information, you now need to know how many relocations to apply.
So, you scan the tag list for a
This tells you the byte size of the list of relocations;
not the count of relocations.
you have to write a division routine
on any platform you want to use ELF on
and which doesn’t have built-in division instructions.4
You can’t put such code off in a disk-resident file; it must take up valuable ROM space.
Alternatively, you write code that always assumes you have addends,
and performs pointer-arithmetic hocus-pocus to compensate if you don’t need addends after all.
you’re writing horribly ugly,
and definitely unnecessary code given other loader formats that exist now, or even back when ELF came out.
Oh, with related tags like these, either they’re all present, or all not.
If any one of those tags are missing?
the entire ELF is malformed,
and unless you’re willing to traverse section headers,
you just don’t know how big the various
.rela.* sections are,
and to what other sections they apply to.
Every ELF loader I’ve seen just throws the ELF out as corrupt
instead of trying to be intelligent about section headers.
Doubly sorry. Tell me again why section headers are part of a loadable file, if they’re just not going to be used?
One of the biggest beefs I have with ELF relates back to the lack of normative procedures concerning how to actually use it. There is a good reason why Fred Brooks is famous for having written a thing, which reads,
Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious.
Nowhere is this more demonstrable than with ELF vs. well,
just about any other executable format you care to name,
with the possible exception of
The ELF structure gives no hint at all about how to load it.
a.out, hunk format, et. al.
there is a formal structure which makes it clear what you can do with it.
In fact, Eric S. Raymond has written at length about the importance of simplicity in open source software, including this gem:
Smart data structures and dumb code works a lot better than the other way around.
You’ll see how this particular principle is applied in Amiga’s Hunk Format later on.
Needless to say, while researching how to write a dynamic ELF loader for my own project (Kestrel-3), I’ve all but given up. I just can’t figure it out on my own, despite being a C-programming veteran since 1985. When looking at pre-existing examples (e.g., on Github), one thing struck me: they’re all different. Since there’s no normative procedures concerned with how to load an ELF into memory, and because each ELF binding is unique to their OS and CPU architecture, everyone ends up reinventing their own wheel a slightly different shape.
Let that sink in for a minute: me, a person who advocates frequently that you should reinvent your own wheels, who is building his own computer entirely from scratch, is concerned that there are a little bit too many wheels to choose from when it comes to ELF loading.
This is a concern to me because I just know bugs will be rampant. Edge cases have already been exploited by crackers. And finally, which platform-specific implementation represents the closest set of semantics to what I’m looking for? Which is right for my needs?
I’m thoroughly happy that the Unix-y environments finally have a standard everyone can agree upon for shared objects. Don’t get me wrong! But does it have to be this complicated? I don’t believe it needs to be so. To this end, ELF is wasteful because it’s over-engineered and under-documented in exactly the ways that it so desperately needs to be.
Now we come to the issues that concern PIEs specifically.
ELF mandates that all PIEs have a special
whose sole purpose is to tell the loader which program will actually load the ELF file.
Let that sink in for a moment. I’ll wait.
One way (of so many different ways!) to load a PIE
mmap the ELF you want to load into a process address space,
mmap the interpreter specified,
and then invoke that interpreter.
Now, I get it –
the Unix System V Release 4 authors
just wanted to extend the
#!-convention to binaries.
I get it, and I think it’s a decent idea, in theory.
In practice, not so much.
There are so many different ways you could go about doing this,
and in fact, Linux itself uses just such a different approach that makes far more sense.
The kernel basically looks at the file header’s magic constants,
and based on those, iterates through a list of potential loaders until it finds one that it thinks will work.
The chicken-and-egg problem of loading a loader to load the loadable ELF file doesn’t happen.
How do we know it’s a chicken-and-egg problem?
Because according to the ELF ABI specifications,
you guessed it,
ld.so for your specific platform must itself be an ELF shared object file.
Not only that,
but get this, when it’s first mapped into your process’ address space,
it must relocate itself.
The kernel offers no assistance.
The interpreter obviously can’t, since there is no interpreter for the interpreter itself.
And, of course, the process image that’s really being loaded hasn’t even been touched yet at this point.
So how does the “interpreter” find out about itself, like where it sits in its own address space? It’s obviously much too much to ask the interpreter to include a small assembly thunk that discovers its own place in the address space,5 so obviously the kernel has to provide this information somehow via something called an “auxiliary vector.” This is yet another tag-list, but this time, it’s constructed dynamically by the kernel for the purposes of kicking off the interpreter. I should also point out that a similar auxiliary vector is constructed by the interpreter for consumption by the intended process being launched.
I’m not joking when I say that kicking off a process in Unix incurs massive overhead.
I’m willing to bet that all this auxiliary vector mangling is at least as expensive as any TLB flush overhead incurred
from forking of parent process state.
(Especially since it appears that
mmap to allocate space for these vectors in the first place.)
Think about it:
to get a C program running,
you already need an argument vector and a complete,
local yet potentially tweaked copy of the program’s environment vector.
Now we have to throw in an auxiliary vector too?
And kicking off a process incurs double the overhead, since you must also do this for the interpreter as well!
But, we have 3GHz CPUs and up, so who cares, right? It’s all good.
The overwhelming majority of Linux executables fall under this category of artifact. These are binary images which, despite being able to take advantage of shared objects of other kinds, nonetheless manage to combine the worst attributes of both static executables and shared objects, without compensating returns in any other way.
They are assigned precise load addresses in the virtual address space, and so are susceptible to exactly those kinds of attacks which ASLR is intended to prevent. They’re also impossible to use in single address spaces, and cannot be used in a segmented memory architecture of any kind. Yet, despite these similarities with statically linked executables, they nonetheless require all the complex infrastructural overhead that shared objects typically require. What’s not to love about them?
The kernel could just as easily have loaded your
a.out format binary,
which then calls upon
dlopen() to open a library via a kernel-resident system call.
This system call could then have dynamically upgraded the process’ address space to include
from which it could then include any dynamic library the main binary desired.
There simply isn’t any good reason why you need a special ELF format just for this other than pure convenience.
With no redeeming qualities at all,
it baffles me how this format became the most popular of all the ELF artifacts.
I mean, if you’re going to invest in the machinations to support shared objects,
why not just make everything PIEs, and leave the mere possibility of an address space attack in the bin?
Alas, such is not so: to this day, to build a PIE, you must do so explicitly by specifying
-fPIE to the compiler, and
-pie to the linker, and then pray that it works for you.
Because, if it’s one thing I’ve discovered in my investigations,
despite absolutely no compile-time or link-time or load-time errors,
PIEs are not guaranteed to not segfault on my computer.
Looks like we’re stuck with these decrepit artifacts for quite a while longer.
Among other things I’m too lazy to recount, I’ve enumerated reasons why I think ELF is bad technology.
I could maybe go on if I bothered or cared to dive deeper, but I think this list is a pretty damning indictment on ELF. In the next article, we’ll compare ELF to other contemporary alternatives, and I think you might be surprised to see how much simpler they are in fact, not just in theory, especially when considering how they can be incrementally improved upon in a backward compatible manner to encompass the salient features of ELF. You’ll see then that there’s no technical basis for ELF’s popularity at all. ELF’s popularity derives from one thing and one thing only: GNU’s adoption of it as a standard, given that AT&T promoted it as their solution to a problem Unix administrators and vendors were having at the time. Like so many things in computing that we must now suffer for, ELF was a mistake that became popular; because, after all, AT&T, the inventors of Unix, couldn’t possibly be wrong about it, could they?
1: I originally wanted to link to a single incident I had in mind. However, upon realizing just how many exploits have happened, I figured I’d show you what a single search query can yield. It’s dumbfounding. When faced with this evidence, one cannot deny the complex semantics behind ELF and its relationship with the surrounding operating system makes the continued reliance upon it a security liability.
2: I didn’t notice, or I’d simply forgotten until re-reading recently, that this standard started out as a platform-specific extension to ELF. This was originally used between Hewlett-Packard and Intel for their respective 64-bit CPUs.
3: Although ELF places an unhealthy amount of support for binaries that prefer to load at specified locations, for the first time we see that ELF allows segments to be loaded at arbitrary addresses. This is why you need to compile your executables as PIEs to support ASLR.
4: And, there are many such platforms, admittedly mostly in the embedded space. While your desktop computer and cloud server farm has evolved CPUs that provides a rich instruction set, they also consume voracious amounts of power doing so. Embedded devices often sacrifice rarely used features to allow for greater power savings. Division is one of those things that, compared to multiplication, is done only rarely.
5: Even though it’s dreadfully simple to do. On x86, you basically call, as a C procedure, something like this:
; If called from code at the very beginning of the text ; segment, this will deliver the base page of the text ; segment. From this, you should be able to determine ; where everything else sits. _find_calling_address: call 1f 1f: pop eax ; Get address of this POP. and eax, 0FFFFF000H ; Find what page we're in. ret
This code does horrible things to your pipeline’s performance;
but, it’s only called once so no big deal.
I/O overhead from
mmap-ing the process image will dwarf any overhead this incurs.
And, if you’re running on an x86-64 platform,
it’s even easier still, since you can use
RIP-relative addressing for most things.
Software engineer by day. Amateur computer engineer by night. Founded the Kestrel Computer Project as a proof-of-concept back in 2007, with the Kestrel-1 computer built around the 65816 CPU. Since then, he's evolved the design to use a simple stack-architecture CPU with the Kestrel-2, and is now in the process of refining the design once more with a 64-bit RISC-V compatible engine in the Kestrel-3.
Samuel is or was:
Samuel seeks inspirations in many things, but is particularly moved by those things which moved or enabled him as a child. These include all things Commodore, Amiga, Atari, and all those old Radio-Electronics magazines he used to read as a kid.
Today, he lives in the San Francisco Bay Area with his beautiful wife, Steph, and four cats; 13, 6.5, Tabitha, and Panther.