The FSet library is a functional implementation of set-theoretic collection datatypes. By "functional" I mean that all update operations return a new collection rather than modifying an existing one in place.
The ultimate goal of FSet is an ambitious one: to be the default choice for collection types in all the languages it supports. By "default", I mean the one you can be in the habit of using most of the time, except when you know you need something more specialized. Any collection library has strengths and weaknesses; none is good at everything. FSet seeks, however, to have no fatal weaknesses that make it terrible for some uses; it tries to do everything pretty well, at the possible expense of being the best at any one thing. This means, in roughly decreasing order of priority:
To achieve these goals I have made the following implementation choices:
These choices are discussed in more detail on the project page. Here, I want to address a more fundamental question: am I really suggesting that the default collection library for ordinary programming should be functional? And if so, why?
Yes, I am absolutely saying that: functional collections are the best choice for general use in many languages like Lisp. Let's look at some reasons why.
The first reason I want to discuss is primarily stylistic in nature and may seem abstruse, but it has very real consequences. In a nutshell: collections are not objects; they are mathematical values. More precisely, in the object-oriented analysis of a domain, it is very rare that a collection will correspond exactly to a stateful entity in the resulting model. Far more commonly, the collection is an attribute of some larger entity that maintains invariants involving the collection as well as others of its attributes.
Here's an example. Consider the object-oriented implementation of a Web shopping cart. This might have, as attributes, a bag (multiset) of items, a current total price, and a current total shipping weight. Whenever an item is added to or removed from the bag, the price and shipping weight need to be updated. That is, adding and removing items are necessarily operations on the shopping cart object as a whole; they don't just operate on the bag. This is what I mean when I say that the bag doesn't correspond to a stateful entity in the model; the stateful entity is the whole cart, with the state of the bag being just a component of that state.
Note that in this example, the price and shipping weight are, almost certainly, going to be implemented as values (numbers). Why does it seem so inevitable that the collection of items would be itself a mutable object? I think the biggest answer is simply habit. Programming began with imperative languages; an appreciation of the benefits of functional programming has only come more recently. We simply haven't gotten around to implementing functional collections in a way that would make them widely useful. Obviously, that is the situation I am seeking to rectify with FSet.
Also, a lot of the time, collection values are single-threaded — referring here not to threads of control, but to cases where a newly computed collection value replaces a previous one, such that the previous value is no longer needed. In the shopping cart example, when we add an item to the bag, we no longer care about its previous contents; so the use of a mutable collection is not ruled out, and seems natural. But actually, collections aren't special in this way; lots of values are single-threaded. Again in the example, the current price and shipping weight are also single-threaded, yet it's unlikely we would create mutable price and weight objects and update them — what would be the point?
You see, it really is mostly just habit. Concepts that can be implemented
in a single machine word we have tended to implement as values, while those
that require allocation we have tended to implement as mutable objects; but
there's nothing necessary about this. Lisp's bignums, for example, are
allocated but have value semantics, so the programmer doesn't even have to
know when a computation on fixnum (single-word) operands has overflowed into
a bignum (allocated) result; and similarly, Java's BigInteger
,
which is allocated, also has value semantics.
And there are real, practical, everyday problems with mutable collections. The worst of these has to do with passing them across interfaces (where, note, single-threadedness cannot be guaranteed). Consider the interface that the shopping cart presents to its client. Obviously, it will be necessary at some point for the client to ask the shopping cart for its items (e.g., to display them on the Web page). What is this operation going to return? If it returns the mutable collection instance the cart is using internally, then the cart exposes itself to invariant breakage: the client could add an item to the collection directly, whereupon the price and shipping weight wouldn't be updated. The cart's only way to prevent that, however, is to copy the collection before returning it: an unpalatable linear-time operation.
In this simple example it may seem implausible that the cart's client would be so obtuse as to modify the returned collection, but programmers do make errors of this kind. Indeed, in more complex situations I have seen very talented and experienced senior developers with PhD's in computer science have a real problem with this. It gets to be very tempting, when you're in a hurry, to mentally conflate the collection itself with the object that owns it — to the point of even sometimes forgetting to create such an object — with the inevitable result that critical invariants get neglected.
A key goal for FSet is to provide completely automatic support for arbitrarily nested collections. Compound types like maps keyed by sets of sequences just work with no additional programmer effort. While it's not every day one needs such a type, it does happen, more often than one might think. (Well, maybe I'm biased by working in compilers and static analysis — such types arise in this field fairly frequently.) It's dangerous to do this kind of nesting with mutable collections, because (for instance) modifying a set that is in use as a map key will blow the map's ordering invariant.
[Could use a good example of a nested collection type]
As multiprocessor hardware becomes much more common, multithreaded
software is becoming much more important. This brings up a second very good
reason for expanding the use of functional collections: they can be
thread-safe without locking. This is a very big deal, because collections are
by nature general-purpose. One certainly wouldn't want to pay the cost of
synchronization on every access to a collection, since many collections are
used in only one thread; yet failure to synchronize a collection that really
is being updated in multiple threads is a recipe for disaster. I think it is
interesting to look at what the Java API documentation says about this in,
for instance, the page for java.util.TreeSet
:
Note that this implementation is not synchronized. If multiple threads access a set concurrently, and at least one of the threads modifies the set, it must be synchronized externally. This is typically accomplished by synchronizing on some object that naturally encapsulates the set.
Notice the agreement with my argument in the previous section: there is usually an object that naturally encapsulates any collection.
If no such object exists, the set should be "wrapped" using the
Collections.synchronizedSet
method.
It's nice that Java makes it easy to apply this synchronization wrapper, but (a) you have to remember to do so, and (b) once you do so, even non-modifying operations, which tend to be the most common, are synchronized. So from a performance perspective it will be tempting to try to do without synchronization whenever possible; but the price of omitting it when it is needed is very high. And threading exacerbates the quandary of whether to copy a collection before returning it from an interface; the possibility that the client will modify the returned collection in another thread, which would not just cause it to have the wrong contents but could break it completely, provides additional motivation to copy it.
Functional collections cure all these ills. They never need to be synchronized. They can be returned from interfaces without copying and without risk. And the additional performance cost of functional collections relative to imperative ones — consisting primarily of additional load on the garbage collector — is somewhat made up for by the elimination of copying and synchronization costs.
[Update: Java 1.6 now has ConcurrentSkipListSet
, which is
thread-safe without locking. That renders this particular point moot —
for those using Java 1.6. For Lisp users, and those Java users who haven't
yet switched to 1.6, it still applies. I can't resist pointing out, though,
that the Java designers obviously didn't intend
ConcurrentSkipListSet
for widespread use, or they would have
given it a shorter name.]
(While this issue doesn't apply to Lisp, I think it's worth mentioning
concerning functional collections generally.) The addition of generic types
to Java raised a conundrum. One would think, given a class A
with a subclass B
, that Set<B>
would be a
subclass of Set<A>
. But with mutable collections, this
obvious and highly desirable rule is problematic. Consider:
Set<B> bs; A some_a; Set<A> as = bs; // here's the problem as.add(some_a);
Once the assignment on the commented line has been made, it becomes
possible, as on the last line, to add an element to the set that violates its
original type — the same set is, after all, stil accessible as
bs
. So subtyping cannot extend through generic types.
Functional collections cure this problem, making it possible for, e.g.,
PureSet<B>
to be treated as a subclass of
PureSet<A>
(PureSet
is the FSet/Java
functional set interface). All we need is a way to specify, in the definition
of the generic class, that the subtype relationship applies. Alas, Java,
unlike Scala, doesn't
have a way to do that. I hope one will be added soon.
Do you think it's strange that I would argue that functional collections are the superior choice for general use? I think it's strange that this opinion is not already the conventional wisdom. Consider this: as I mentioned above, both Lisp and Java have functional infinite-precision integers. Why are these functional rather than mutable? One could make these arguments:
Someone with the opposite opinion could argue thus in counterpoint:
With the exception of the first point — consistency with the natural behavior of machine-word integers — these arguments are all identical, on both sides, to those that apply to functional collections. And as far as that first point goes, if there was ever a property that could be called merely an accident of implementation, surely this is such a property! Do you really want to argue that, of two mathematical concepts, integers and sets, our implementations should differ in design and semantics in such a deep way just because some integers fit in a machine word? In fact, certain restricted-domain sets can be implemented as bitfields, and therefore sometimes also fit in a machine word!
Now to this argument that people used to imperative programming find
functional semantics unnatural. Of course this is true, but it's also
circular; the more that people are exposed to functional programming, the
more natural they will find it. (I understand that it's not uncommon for
inexperienced Java programmers to think that, for instance, the
BigInteger.add
method modifies the instance on which it's
invoked. I can't resist pointing out that this is partly because the method
is badly misnamed. Names of functional methods like these should be
prepositions (plus
, with
) or nouns
(union
, difference
); they should never be verbs
like add
. It's easy to see how a casual or inexperienced coder
might think that a.add(b)
would modify a
;
a.plus(b)
is more likely to suggest a + b
.)
So if you're willing to pay the GC cost for bignums or
BigInteger
s, I don't see how you can argue that the GC cost of
functional collections is too high. Sure, there may be occasional situations
where profiling reveals that it would be worth recoding a
performance-sensitive algorithm to use imperative collections; and you always
have the option to do that. But for general use, the default choice —
the one to use unless you have a specific reason to do otherwise —
should clearly be functional.
Small updates to FSet collections — insertions and deletions of single elements, for instance — take only "log time", i.e., time logarithmic in the size of the collection, because in a functional binary tree, only the particular leaf being modified, plus its ancestors up to the root, need to have new versions created; the rest of the tree is shared between the old version of the collection and the new one.
Here is a list of each FSet operation which has the given time complexity.
Also, the set and bag union, intersection, difference, and subset/subbag test operations, along with map merge and all the collection compare operations, implement an optimization whereby if the operands are historically related collections — one was formed by doing a small number of single-element insertion or deletion operations on the other one, or they are both related in this way to a third set — these operations can run in log time rather than linear time.
I don't really know where the idea of functional collections originated. I got it from the little-known language called Refine, originally developed at Kestrel Institute in the mid-1980s. But I recently came across some documentation for an older language called SETL, and it seems pretty clear that Refine was heavily influenced by SETL in this area. But I don't know whether SETL was the first language with functional collections.
Anyway, I worked in Refine for years, mostly doing software reengineering and static analysis. It was this experience that convinced me of the stylistic benefits of fully nestable functional collections. Refine's implementation of these, however, used unordered lists, which is great for very small collections and terrible for large ones. I wanted a Common Lisp implementation with much better time complexity properties. So I wrote FSet.