@macshonle@c.cim - Mastadon

21 March 2010

Beware Approximate Models

Joel Spolsky has closed his blog Joel of Software, but the discussion he started continues. One persistent article is on The Law of Leaky Abstractions. While there are other critiques of the claims in the article, none have quite settled on what I found unsettling about it. We must properly use abstraction and beware of both its advantages and limitations. In particular, it is dangerous to use an abstraction with an oversimplification of how it works.

What Abstraction Is
Abstraction
is a mechanism to help take what is common among a set of related program fragments, remove their differences, and enable programmers to work directly with a construct representing that abstract concept. This new construct virtually always has parameterizations: a means to customize the use of the construct to fit your specific needs. For example, a List class can abstract away the details of a linked-list implementation-- where instead of thinking in terms of manipulating 'next' and 'previous' pointers, you can think on the level of adding or removing values to a sequence. Abstraction is an essential tool for creating useful, rich, and sometimes complex features out of a much smaller set of more primitive concepts.

Abstraction is related to encapsulation and modularity, and these concepts are often misunderstood.

In the List example, encapsulation can be used to hide the implementation details of a linked-list; in an object-oriented language, for instance, you can make the 'next' and 'prev' pointers private, where only the List implementation is allowed access to these fields.

Encapsulation is not enough for abstraction, because it does not necessarily imply you have a new or different conception of the constructs. If all a List class did was give you 'getNext'/'setNext' style accessor methods, it would encapsulate from you from the implementation details (e.g., did you name the field 'prev' or 'previous'? what was its static type?), but it would have a very low degree of abstraction.

Modularity is concerned with information hiding: Stable properties are specified in an interface, and a module implements that interface, keeping all implementation details within the module. Modularity helps programmers cope with change, because other modules depend only on the stable interface.

Information hiding is aided by encapsulation (so that your code does not depend on unstable implementation details), but encapsulation is not necessary for modularity. For example, you can implement a List structure in C, exposing the 'next' and 'prev' pointers to the world, but also provide an interface, containing initList(), addToList(), and removeFromList() functions. Provided that the rules of the interface are followed, you can make guarantees that certain properties will always hold, such as ensuring the data-structure is always in a valid state.

Although terms like abstract, modular, and encapsulated are used as positive design descriptions, it's important to realize that the presence of any of these qualities does not automatically give you good design:

  • If an n^3 algorithm is "nicely encapsulated" it will still perform worse than an improved n log n algorithm.
  • If an interface commits to a specific operating system, none of the benefits of a modular design will be realized when, say, a video game needs to be ported from Windows to the iPad.
  • If the abstraction created exposes too many inessential details, it will fail to create a new construct with its own operations: It will simply be another name for the same thing.

Joel's So-Called Law of Leaky Abstractions
Joel's law of leaky abstraction states: "All non-trivial abstractions, to some degree, are leaky."

We should first pause to examine the law itself-- like similar "laws," it is not falsifiable. Based on how you define what is trivial or not, or what is leaky or not, any claim of leakability can be made about any abstraction.

So, we should look at what Joel identifies as leaks. Joel's leaks fit the following patterns:

  • Accidental Limitations - E.g., there is no perfect String class in C++, because the language lacks certain abstraction features. Poorly designed APIs are also limitations that are purely accidental.
  • Layering Limitations - E.g., TCP is layered over IP, and so it shares the fundamental limitations of IP, even though it provides new functionality built on top of it. NFS is an example of a layering that creates a new interface to a service, but cannot provide the same availability characteristics expected of other file systems.
  • Performance Limitations - E.g., SQL queries are supposed to be fully declarative (as if writing in relational algebra directly), but when performance matters, you must have intimate knowledge of the database's execution.

Accidental Limitations are a problem when it comes to using real code, particularly when that code is part of an API you need to use. Claiming the entire concept of abstraction is "leaky" because people can write poor code isn't quite right: We could try to claim the same thing about any other useful technique or tool that may be misapplied.

As for Layering Limitations, these aren't really "leaks": We cannot expect abstraction to do what is physically or computationally impossible (wires in a network can get cut, and hence your music over TCP/IP won't play).

That leaves Performance Limitations, and is the strongest argument Joel makes. When the term separation of concerns was coined, one concern described was efficiency. To get the best performance, we need to understand things like memory hierarchies and database engines. You can reason about efficiency separately from other concerns: For example, one day I can look at my database queries and reason about their correctness; and on another day, I can look at the same queries and reason about their efficiency. The query abstractions help me reason about correctness-- and even efficiency, because many optimization cases I would otherwise need to worry about are handled for me.

Why can't the efficiency concern be properly isolated? Because efficiency is inherently a crosscutting concern. Any single weak-link in a chain of unbelievably fast code will drag the entire performance of the system down with it. Research in aspect-oriented programming, which is focused on isolating crosscutting concerns, has had limited or only domain-specific successes in isolating efficiency concerns. It is a hard problem, the philosophy of which would lead us to talk of artificial intelligence and computational complexity.

So, in examining Joel's argument, we find he wants us to believe the problems of efficiency are related to the problems of poor written APIs and also the problems of fundamental limitations. No wonder he is left to conclude that abstraction is "dragging us down." But we don't have to be so gloomy.

Leaks that Aren't
Because abstraction cannot do the impossible, and because isolating efficiency concerns is a long shot possibility, we have to be careful when using abstractions. In particular, we must be sure with the abstractions we use that we are modeling them correctly.

Donald Knuth has said that a computer scientist must have "the ability to work with multiple levels of abstraction simultaneously," from high-level to low-level.1 We cannot work with simpler models in our minds just because we are working with something that looks simpler.

Let's look at garbage collection as an example. Today, unless a language is meant for writing operating systems, virtual machines, embedded systems, or any other exceptions, it will have garbage collection. The basic idea is that you can allocate memory, but do not have to worry about deallocating it. However, that is the basic idea, and not the true abstraction! In reality, we don't have infinite memory. Here is the actual idea behind garbage collection:

Allocated memory that is no longer reachable from any of the program's threads can be safely deallocated by the garbage collector.

This idea is radically different from the infinite memory idea. For instance, we now have to know about the concept of reachability. How does this change the way we use garbage collection? Under the first, approximate model of GC, we will be lead to memory leaks under certain circumstances. In the second, correct model of GC, we will prevent memory leaks because now we know about WeakReferences.2

By thinking that the GC "is magic," we will be lead down the wrong path. For example, there are times when we know that an object "will no longer be used," but that is not the same as "is no longer reachable." Thus, if we have the right conception of the GC in mind, we find ourselves using WeakHashMaps more frequently.3 My rough estimate is that 20% of the HashMaps I use need to be WeakHashMaps; without the proper conception of GC, then a significant portion of my code would be wrong!4

Abstraction is about Design
Abstraction is not primarily about overcoming fundamental limitations of the machine or its environment. Abstraction is about organizing your code to match your conceptions of your design.

Although abstraction provides many benefits that simplify the task of programming, these simplifications are not strictly necessary. We can write our programs without so many abstractions. For example, whenever we need a Node structure that holds a value and pointers to two other Nodes, we can use a simple Node class, and use it for both a doubly linked-list and a binary tree. If we programmed like this, we could even save a few lines of code: The routine to get the right-most node of a binary tree could also be used to iterate through a doubly linked-list.

We should stop thinking of abstractions necessarily as higher or lower level than each other. Abstractions are relative to your design, and what looks like "the same" abstraction in one case can be too high level (leaving out too much), too low level (over specified and lacking flexibility), or just right.

Instead, think about abstractions as providing support for your design conceptions. If, in the process of creating your abstractions, you are lead to "leaks" then the problem is in your design.

***
1. Further quoted: "When you're working at one level, you try and ignore the details of what's happening at the lower levels. But when you're debugging a computer program and you get some mysterious error message, it could be a failure in any of the levels below you, so you can't afford to be too compartmentalized."

2. And SoftReferences, for when performance concerns are high.

3. Also see Google's MapMaker factory: new MapMaker().weakKeys().makeMap().

4. When thinking about memory, we also need to remember that Accidental Limitations do come up. For example, when using the substring method, you may need to wrap it around a call to new String().

1 comment: