home                                                                chris256.com
________________________________________________________________________________

                                                                   20 March 2025

The Cost of Abstraction or: How to Write Modular Code in Assembly
-----------------------

INTRODUCTION

Typical advice for writing "clean" code often emphasises creating more
abstractions. Consider the following examples:

  * "Decompose functions until they each do one thing"
    means creating more functional/procedural abstractions.

  * "Pass state explicitly rather than using global variables"
    requires grouping state into abstract 'Context' datastructures.

  * "Encapsulate your datatypes"
    means designing abstract interfaces to interact with your datatype 

I acknowledge abstraction is a powerful tool, but I argue that it have
associated costs which are too often overlooked by developers trying to
write clean code by the conventional advice. In this article I shall make
explicit four of these costs and discuss some illustrative examples for each.
By designing abstractions with these potential costs in mind, I believe that
their effects can be mitigated.  And although it is a very sweeping claim
to make, I believe that most software invents too many abstractions. I think
if developers were more aware of the downsides of creating new abstractions
then there would be fewer.


1. EXPRESSIVENESS

An abstraction can be thought of as a binary relation from a set of objects
in the abstract space and a set of objects over which we abstracting. If the
relation is not surjective, then expressiveness is lost by incompleteness
since not every base object has an abstract representation. If the relation
does not have a retraction when we consider its domain to be limited the image
then expressiveness is lost by ambiguity because an exact representation in
the base space cant be uniquely identified by the abstract space.

To make this clearer lets use the C programming language as an example.
Consider a binary relation from the set of all valid C compilation unit
source text and the set of all valid object files. There is an arrow iff the
source text could compile into the object file according to the C standard.
There exist object files for which there is no corresponding C program [1]
so expressiveness is lost be incompleteness. And object files cant be uniquely
expressed with a C source program [2] so expressiveness is lost by ambiguity.
There are some aspects of an object file which the C programmer has no control
over. When you are designing an abstraction, be aware of what properties of
the base object you have lost to ambiguity or incompleteness.


2. OBSCURITY

Paper atlases maps baffle me because I have not learnt the meaning of the
abstract symbols representing landmarks. If there were less abstraction
(satellite imaging perhaps) I would be less confused since I do understand
the underlying concepts which are being abstracted (churches, rivers, etc).
Obviously any abstract system requires its users spend some time understanding
it before it becomes useful, but when it is built over an underlying system
that is already well understood then you should consider weather or not they
be better off using that underlying system directly.

To demonstrate obscurity I dug out a particularly offensive shell script I
wrote a while ago (below), its purpose is to generate an aligned html table
listing of the current directory. It manages to be very compact thanks to
the liberal use of abstractions provided by bash and the unix commands. But
it is clearly very difficult to read (or write) because no sane person has
memorised the format specifiers used by find, the semantics of sort's '-k'
option and that iec stands for "International Electrotechnical Commission"
in this context. I would have been better off writing a slightly longer but
more readable python or perl script.


    #!/bin/bash
    header="Type  Date        Size  Name"
    format="%y\t%TF\t%s\t%f\n"
    (find .. -maxdepth 0 -printf $format ;  find . -maxdepth 1 ! -name ".*" -printf $format) \
      | sort -t$'\t' -k1,1 -k4,4  \
      | numfmt --delimiter=$'\t' --field=3 --to=iec \
      | awk -F'\t' '{$4="<a href=\"./" $4 "\">" $4 "</a>"; print}' OFS="\t" \
      | awk -F'\t' '{printf "%-5s %-11s %-5s %s\n", $1, $2, $3, $4}' \
      | sed "1i $header"


3. PERFORMANCE

The performance cost of abstraction is well understood in some contexts
such as in programming language design but in other domains it is often
overlooked. Suppose we are writing a program which extracts a particular
field from a large json file and then exits. Most json libraries would
accomplish this by first parsing the entire json file and loading all its
fields into memory. Even if the library supported some kind of lazy parsing,
it would likely still unnecessarily take the time to cache the results of
the partial parse. The most efficient way to parse a json file depends on
the applications access patterns, and an library built for general use cant
anticipate the access patterns of all application. Or put more generally:
an abstraction must make generalisations in order to be useful, but in
making these generalisations it can often loose information relevant to
optimising for performance.


4. BOILERPLATE



OBJECT ORIENTED PROGRAMMING

SUMMARY

NOTES

[1] In x86_64, the INT 3 'SIGTRAP' instruction is used by debuggers to set
    breakpoints. The C standard has no way to write this instruction, so any
    object file containing it cant be expressed in C due to incompleteness.
[2] For example because you could always insert no op instructions to get
    another object file corresponding to a C source.


OOP is the exemplar.
Only make abstraction when benefit outweighs this cost.
In assembly the cost is so great that you will be going in circles if you stick
to the advice.
The costs of abstraction are:
  * expressiveness (markdown over html, glfw tablet input)
  * obscurity (Concise shell script or verbose python script, rust macros, OOP)
  * performance (High level languages, XML Parser, if two cstrings match then print their length, xlib)
  * boilerplate (OOP, Asm)
Building interfaces & abstractions is hard. A few good abstractions is better
than many shit ones.