How toy problems are different

A lot of simple text-processing problems involve operating on the words of the input. Norvig's spellchecker, for instance, begins by extracting the words:

def words(text): return re.findall('[a-z]+', text.lower())

So do some of the old Shootout problems (though none of the currently active ones, and they don't keep old ones around, so if you want to see an example you'll have to use the WayBackMachine). This sets my library-senses tingling. If so many programs need to split text into words, shouldn't words be in the standard library?

There are two problems here. The lesser is that the definition of "word" varies a bit - sometimes it's defined by alphanumeric characters, sometimes by whitespace. This isn't a showstopper, because some reasonable definition will probably work for a large fraction of the potential uses. The greater problem is that splitting things into words is much less common in real programs than in toy examples. Words are like Fibonacci numbers: they're familiar and easy to ask for, so they occur unusually often in made-up requirements. Neither is common in real programs - probably not enough so to warrant including them in a language's library.

This is one of the pitfalls of testing a language against toy problems. Unlike real problems, they're strongly biased toward simplicity of requirements, which affects what they demand from a language. Real programs tend to spend a lot of code on I/O and error handling, neither of which turns up much in toy problems. Is it a coincidence that these two areas are awkward to express in most languages?

Where have all the pointers gone?

Dynamically typed languages represent almost everything as tagged pointers, which sounds very inefficient: doesn't that waste a lot of space? And doesn't 64-bit addressing make this waste much, much worse?

I used to take this seriously. But one day a few years ago I was playing with Squeak, and wondering why its images (savable worlds) were so large - 10 MB, when the original Smalltalk ran in 64 KB. I stumbled across allObjectsDo:, which iterates across all objects in the heap, so I wrote a heap profiler to find out where the space was going. Once I finished it, I discovered Squeak came with a much better one, so I used that instead. Here are all the classes that take more than 1% of the space:

ClassPurpose# InstancesSpace%
CompiledMethodbytecode339512.0 MB20%
Bitmapgraphics12081.8 MB17%
Stringtext287431.7 MB16%
Arrayvarious302631.3 MB12%
ByteArrayvarious379800 KB8%
Symbolselectors mostly27786550 KB5%
MethodDictionarymethod dispatch3092320 KB3%
WeakArraysymbol table mostly15230 KB2.2%
PointGUI17193200 KB1.9%
MorphExtensionGUI3344170 KB1.6%
ColorGUI8323163 KB1.6%
Associationhashtables10761126 KB1.2%
Floatboxed float9732114 KB1.1%
All others1.1 MB10%
Classes without pointers>6.6 MB>65%
Total10.2 MB100%

This shows why tagged pointers are not a big space problem: there just aren't that many of them. Five of the top six classes, accounting for two thirds of the space, are binary data. (The exception is Array.) Dynamically typed languages, it turns out, don't encode most data as tagged pointers.

Well, of course. What do you do with a large data structure? You find a more compact representation. Often that means replacing low-entropy pointers with high-entropy bytes. That's what happened to all of those five non-pointer classes. Dynamic typing makes no difference: as long as you have some sort of byte arrays (in Smalltalk, Class»variableByteSubclass:), you can do this in dynamically typed languages the same way you do it in statically typed ones. If tagged pointers are wasteful, the waste is in boxing small objects like floats, not in the presence of pointers.

Of course this is only one image, and an odd one at that - instead of application-specific data, it's dominated by library code. The reason Squeak is so much bigger than the old Smalltalks is that it has a much bigger standard library - and no autoloading, so it's all in the image all the time. Code accounts for at least 30% of the space: CompiledMethod, Symbol, MethodDictionary, and WeakArray (most of which is the symbol table). This 3 MB would be insignificant in a bigger image, but otherwise I expect larger applications to have space profiles much like Squeak's. Most classes are made of pointers, but most of the space goes on the few that aren't.