You can find a link to my book on programming language architecture
here, and I can't resist
a link to a blog by Ashley Feniello who
actually
liked the book :-)
.
I guess this page is intended as a sort of addendum to the book as it covers topics I never got a chance to discuss, but it's also a lot more haphazard and random, sort of a sounding board for those topics in case I ever get around to incorporating them.
I should also point out that a lot of this is me explaining the ideas to myself. I reserve the right to be wrong.
Recently I've been thinking about how best to describe a really simple language so that the reader can just "get it" in one go. As you probably know I'm quite fond of UML and this is what I've come up with:
I'd hope that this diagram stands alone, but for the impatient I'll describe it.
You can see that the implementation breaks down into three primary groups of classes: Syntax structures, Environment and Operations. The environment is the simplest and the best thing to look at first. You can see an abstract base class Env with concrete Frame and Empty subclasses. The Env class provides one concrete method: extend() which takes a dictionary (hash, or whatever) as argument and creates and returns a new Frame containing that dictionary and a parent pointer to the original environment. The Frame and Empty subclasses supply a lookup() method which, for the empty env just errors. For a non empty env it either looks up the symbol in the dictionary or calls lookup() on its parent, passing it the symbol.
The Syntax group of classes all inherit from a common abstract
Exp base class that declares one abstract method: eval(env),
that they all must implement. A Constant evaluates to itself.
A Symbol looks itself up in the argument environment. A
Conditional evaluates the test then either evaluates the
consequent or the alternative. Function declarations evaluate
by creating a new Closure object passing it the body and
formal arguments to the function, along with the current environment.
Application of an operation involves evaluating the operation
expression in the current environment, evaluating the argument
expressions in the current environment, then calling the resulting
operations apply() method on the resulting evaluated
arguments. Note that in operands.map(eval(env))
that eval
is sort of Curried: the argument env
is explicit, but the map supplies the objects whose eval methods
are to be called.
Lastly, Operations themselves share an abstract Op base class with an abstract apply() method. An Op is either a Primitive or a Closure. The implementation of the primitive apply() is just an implementation detail, however the implementation of the Closure's apply() ties the whole thing up:
That's about as simple as I can make it.
Another recent idea I've had is to build a compiler for a pscheme-like language (different syntax, extended functionality) that would target an abstract machine much as Java does (I know Java calls it a "Virtual Machine" but that's an accident of history since Java originally targeted a real CPU and the replacement machine was a "virtualisation" of that.) I started looking at Parrot, but Parrot has built-in support for a single continuation only, and would make a second "failure" continuation doubly difficult to implement.
So I started thinking about building my own. The real joy of this is that I can prototype it in a high-level language like Perl, then re-write it into something fast like C or C++ later. I actually have a (mostly) working version in Perl for PScheme that runs about seven times faster than the interpreter, but it has issues.
Anyway I think the basic ideas are sound. In an interpreter a closure is a function plus an environment. In a compiled language that's even simpler, a "function" is just a machine address, so a closure is a pair of addresses: address of environment on the heap and address of machine instruction in the core. A Success Continuation on the other hand has a lot more context. It contains the return address and the old environment, but also must restore any temporary variables, along with the previous continuation that will be returned to later. Finally a Failure continuation is a complete reset. It does not preserve any data when it is invoked, but completely restores the Abstract machine to a previous state.
So without further ado, here's my abstract machine:
Usefully, the different subcomponents (Closure, Continuation and
Failure) are all contiguous, so invoking any of them is just a
single memcpy()
from the stored object (on the heap)
to the "cpu". I don't know if this is just blind luck or something
deeper.
For each subcomponent there would be two abstract machine instructions: one to set up a new structure and one to invoke it. Specifically:
create_closure(PC, ENV)
call_closure()
memcpy(&AM, AM.TEMPS[n], sizeof(struct closure))
create_continuation(PC)
AM.CONT
at the new continuation.continue()
memcpy(&AM, AM.CONT, sizeof(struct continuation))
So the plan is:
So far, I have a parser and a partial interpreter for a new language called "F-Natural": If you think of ML but with real Unification, backtracking and a javasript-like syntax you will get the basic idea. I'm still a long way from world domination but we live in hope. I may post some code here later.
Rather than a single "Environment" type, a compiler distinguishes between compile-time and run-time environments.
In an interpreter, the value of a symbol is the result of looking it up in the current environment. That lookup process involves traversing the environment, frame by frame, until the symbol is found and the associated value is retrieved. The interpreter's environment is constructed at run-time by closures extending a previous environment with bindings of formal to actual arguments.
A compiler can't know the values of variables at compile time.
It can however, determine the location those variables will have
in the environment at run-time. A compiler will perform lexical
analysis of code analagous to evaluation. It passes and (when
entering a closure) extends a compile-time environment that contains
only the symbols, not their values. When it comes across a
variable access it looks up the symbol in the compile time
environment and replaces the variable with its address in the
environment (i.e. three frames back, two variables in
).
At run time that data can rapidly be retrieved from the run-time environment (which contains only the data) without any laborious traversal and comparison.
That gives us three classes and a dumb data object for the location (all of this is pseudocode)
class CTE { private CTE parent; private Symbol[] symbols; Location lookup(Symbol s, int framecount=0) { if (symbols.contains(s)) { return new Location(framecount, symbols.positionOf(s)); } else { return parent.lookup(s, framecount + 1); } } }
class RTE { private RTE parent; private Datum[] data; Datum lookup(Location l) { this._lookup(l.framecount, l.pos); } Datum _lookup(int framecount, int pos) { if (framecount) { return parent._lookup(framecount - 1, pos); } else { return data.valueAt[pos]; } } }
class EmptyCTE extends CTE { Location lookup(Symbol s, int framecount=0) { error("symbol %s not found", symbol.name); } }
Extenders for the CTE and RTE take arrays of symbols
and values respectively, otherwise they are just the same as interpreter
extend()
methods.
Here's a different take on environments. In a pure functional language
all data is immutable, end of story, and that would apply to any
environments implemented in such a paradigm too.
Operators like define
would not be allowed to modify an environment,
instead such operators would have to return a copy of the environment
with the modified value. This has some interesting implications for
any language using such an environment. For example:
define PI = 3.14159265; fn printPI { print(PI); } define PI = 4; fn printNewPI { print(PI); } printPI(); // prints 3.14159265 printNewPI(); // prints 4
So each fn captures the current environment, and since nothing is allowed to change, the environment captured is precisely as it was when it was captured.
So how can we make a functional environment that is even remotely efficient? A standard technique is to implement it as a binary tree. Let's assume symbols have unique numeric identifiers, so that symbol lookup can be reduced to numeric lookup. THen our environment can be a tree who's left nodes all have indexes less than the current node, and who's right nodes all have indexes greater than the current node. Here's a pair of classes that implement such a tree:
abstract class Env {} class Node extends Env { Env left; Env Right; int index; Datum value; Datum lookup(int ix) { if (ix < index) { return left.lookup(ix); } else if (ix > index) { return right.lookup(ix); } else { return value; } } } class EmptyNode extends Env { Datum lookup(int id) { error(...); } }
So far so good, but what about extension?
class Node extends Env { // ... Env extend(int ix, Datum val) { if (ix < index) { return new Env( left.extend(ix, val), right, index, value ); } else if (ix > index) { return new Env( left, right.extend(ix, val), index, value ); } else { return new Env( left, right, ix, val ); } } } class EmptyNode extends Env { // ... Env extend(int ix, Datum val) { return new Env( self, self, ix, val ); } }
extend()
only makes a copy of the path through the tree that
was traversed in order to find the node to add or replace. this means that for a balanced
tree, extend()
is O(log n) of the number of items in the environment, which is
not too bad.
Note also how the EmptyNode
behaves as a singleton without explicitly
requiring a static instance
variable. This makes the implementation
quite space efficient too.
The implementation does not support deletion of values, but that is ok since only a non-functional language with backtracking would require it, in order to undo side-effects.
Lastly, note that because these environments are immutable, we do not need any notion of an environment frame to encapsulate local changes, the old environment is still present and will be reverted to outside of a closure just as if we were using environment frames.
Of course this completely ignores the issues of maintaining balanced binary trees, for the sake of simplicity.
In my book I briefly mention (in a footnote I think) the Lambda Calculus. It's probably time to expound on that. As I said in the book, the Lambda Calculus is a mathematical discipline that predates computers, but is amazingly relevant to computer language design. Basically the Lambda Calculus assumes that there are nothing but functions, each taking only one argument, and it derives all possible computations from them.
I won't bother describing the specific syntax of the LC right now, rather let's take a look at a translation of some of the LC into Perl.
First off, functions can only take one argument. So how do we imitate functions, such as addition, that require more than one argument? The answer is a technique called "Currying" (after the mathematician Haskell Curry who gave the better part of his name to a programming language.) To summarise, a "Curried" function of two arguments is a function that takes one argument and returns a function that takes a second argument and returns the result. So for example Currying a perl function add:
my $add = sub { my ($a, $b) = @_; $a + $b; }
(using anonymous subs throughout for consistency) produces:
my $add = sub { my ($a) = @_; sub { my ($b) = @_; $a + $b; } }
which you can call like:
$add->(3)->(4);
to get 7.
More interestingly, we can represent truth and falsehood as functions. Assuming implicit Currying from now on, true can be represented as a function that takes two arguments and returns its first one:
my $true = sub { my (Sa, $b) = @_; $a; }
and false similarily as a function that returns its second argument:
my $false = sub { my (Sa, $b) = @_; $b; }
Now we can represent the conditional if as a function of three arguments:
my $if = sub { my ($test, $consequent, $alternative) = @_; $test->($consequent, $alternative); }
It just applies the truth value to the consequent and alternative arguments, so if is just syntactic sugar.
What about data structures? The simplest composite data structure is a pair, and we can do that with functions too:
my $cons = sub { my ($car, $cdr) = @_; sub { my (selector) = @_; $if->($selector, $car, $cdr); } } my $car = sub { my ($pair) = @_; $pair->($true); } my $cdr = sub { my ($pair) = @_; $pair->($false); }
so a pair is a function that takes a truth value and returns the car if it is true and the cdr if it is false, and car and cdr are functions that pass the appropriate truth value to the pair.
(I need to talk here about Church Numerals, when I can figure them out.)
Currying all of the above is left as an exercise for the reader.
So what does the Lambda Calculus actually look like? Well, it's
very terse. Functions are created by
where λ.x
body
is the argument, and
are called like x
where fa
is the function and f
the argument. Of course
you can use brackets to disambiguate, but that's it, nothing more
to it. Here's my attempt at a definition for if:a
λ.t (λ.c (λ.a (tc)a))
I hope I got that right: a function taking argument t
(test) returning a function taking argument c
(consequent)
returning a function taking argument a
(alternative)
which applies t
to c
, and applies the
result to a
(remember t
is Curried).
Phew!
Currying is quite a useful thing to be able to do given support from a language, but it can be difficult to read. For F♮ I hit upon the idea of using a trailing comma to indicate a partial function call:
fn add(a,b) { a + b; }
def add2 = add(2,);
add2(3); // 5
Which provides a visual clue to the reader that add()
in this example is not being called with all of its arguments and
therefore returns a function that expects the rest. Perhaps more
practically, something like this is more visually appealing, once
you get what the trailing comma means:
map(eval(env,) lst);
where we map eval
with a supplied first argument
env
over the items of lst
. This avoids
having to create a closure:
map( fn (l) { eval(env, l) }, lst);
One of the defining features of a high-level language like Scheme, Perl, PHP or almost any other recent language is that they have built-in garbage collection (GC), which makes the programmers life much easier because they don't have to worry about memory management too much. However there are different GC strategies, and costs and benefits depending on the choices you make. I'd like to talk a little about the common GC strategies here, so you can see what the trade-offs are.
Unfortunately for both Perl and PHP they made a bad choice of garbage collector; they both opted for the simplest possible mechanism, that is reference counting. Why it is a bad choice I hope to demonstrate.
Reference counting is simple. Each object being memory-managed has a reference-count field which is incremented whenever a new variable refers to the object, and decremented when a variable stops referring to that object. If the reference count reaches zero, nothing can be referring to that object and it can be immediately returned to the free store, decrementing the reference counts of any objects it may contain in turn.
Now this is initially quite attractive: one of the very attractive features, beyond its simplicity, is that garbage collection happens in tiny little steps, and does not usually interfere with the flow of the program, in contrast to Mark-and-Sweep GC discussed below. The real problem with it is that it doesn't work, at least not in all cases.
The cases where it doesn't work are where there are circular references between structures. If we imagine a situation where A refers to B, B to C and C back to A, and we have an additional reference X also referring to A, and nothing else, then A will have a reference count of 2 (one from X, one from C), and B and C a reference count of 1. Now when X stops referring to A, A's reference count will be decremented to 1, but now that entire cyclic structure of A, B and C is cut adrift: since nothing external refers to it, no reference counts can ever be decremented again, and A B and C all still have reference counts of 1.
There are some fancy tricks in Perl to try to work around the problem, specifically weak references which are references that do not disturb the reference counts, but they are probably more difficult to reason about than simple memory management in low-level languages and so are not a valid solution. Another solution is to have a "guard" object acting as a wrapper around the cyclic structure. If the guard object's reverence count goes to zero its DESTROY method will be invoked, and it then explicitly tells the cyclic structure to remove the circular references so that the components can be recovered. Again this is making memory management the programmers concern.
This is a strategy that at least works in all cases, but has significant drawbacks. With mark and sweep, a "Free List" of objects is maintained that can be used for their space. A normal request for memory will retrieve an object from the free list and put it on to an "in Use" list. If there are no objects available on the free list, then Mark and Sweep Garbage collection is invoked. Mark and Sweep, as its name suggests, proceeds in two stages. The first is to follow all pointers from the current execution context, recursively, and to "mark" all objects found as "in use". The next phase, the "Sweep" phase, traverses the "In Use" list and moves any objects not marked as in use to the free list (also resetting the "mark" on all objects).
The problem is that in a typical application the majority of the
objects on the "In Use" list will not be actually in use, and the
sweep phase will use a lot of resources moving unused objects back
to the free list. This can produce a pronounced "stutter" in
interactive programms, where they appear to hang for seconds at a
time. This behaviour was a common failing of early Lisp implementations,
and one solution there was to provide a (gc)
command
that the programmer could sprinkle around the code in order that
the number of unused objects on the used list never got too big.
Again this is passing memory management back to the programmer,
albeit at a higher level.
The realisation that most supposedly in-use objects actually are not in use is a clue to a more time efficient (but not space-efficient) garbage collection strategy.
To get it to work we have to drop the high-level notion of lists, and instead get down to the machine level and consider free pools and in-use pools, where a pool is just a contiguous region of memory on the heap.
Basically, instead of moving everything that isn't in use from
the in-use pool to the free pool, we move what is in use, as we
find it during the mark phase, to the "free pool" then we swap the
ponters to the two pools: so the free pool becomes the in-use pool
and vice versa. We know that the free pool was initially empty
otherwise garbage collection would not have been invoked. Of course
that means we can no longer have an explicit (gc)
command, but that's not a bad thing. It also means we can dispose
of the Sweep phase altogether.
The remaining details are fairly simple: when moving an object from the in-use pool to the free pool, we must leave a note behind to say that it has been moved, and where it has been moved to. We replace the old copy of the object with a "forwarding pointer" with a flag to say that it has been forwarded. That way any further encounters of the object during garbage collection merely update their pointers to refer to the new copy and then stop, because the object and its contents have already been moved.
Another observation about the memory usage of typical applications provides a more time efficient variation on Copying Garbage Collection. That observation is "the longer an object has persisted, the longer it is likely to persist".
To leverage that observation, we divide our heaps into "generations". The most recent generation being the objects that have not yet been through a garbage collection. Generational Garbage collection inspects the most recent generation of in-use objects first, and if it can reclaim enough space there its job is done. It moves objects still in use to the next generation, and so on (the choice of the number of generations is presumably configurable, and the oldest generation falls back to standard Copying Garbage collection behaviour.)
More details needed, later...
I've gone back to F♮, but decided to abandon the Perl implementation in favour of Java. I might be able to target the Java VM, but I'm not sure yet. In any case it's a good chance to brush up my Java and learn IntelliJ IDEA (CE). I'm using SableCC to parse, and I've just got Log4j working. I'm long overdue writing some unit tests but for the moment, after a week or so, I have a working interpreter and the following sample script that runs as advertised:
// "env foo { ... }" is shorthand for // "def foo = env { ... };". // "env { ... }" extends the current environment, // evaluates the block in // the extended environment, then returns // the extended environment env math { // "fn bar (...) {...}" is shorthand for // "def bar = fn(...) {...};" fn square(x) { x * x } fn adder(x) { fn (y) { x + y } // demonstrates closure // and anonymous functions } env highermath { fn cube(x) { x * square(x) } // finds square in } // the enclosing environment } env badmath { fn square(x) { x + 1; } env highermath { fn square(x) { x * badmath.square(x) } // finds badmath in } // the root environment } def a = 10; // "." is the lookup operator: // lhs is an expression evaluating to an environment, // rhs is a symbol to be looked up: print( math.square(a) ); // 100 // "if/else" is an expression returning a value, // not a statement // so there is no need for the // ternary "?:" operator. // here "if" returns a function to call: print( if (a == 5) { math.square } else { math.highermath.cube }(a) ); // 1000 // here "if" returns an environment in which // we look up a function: print( if (a == 5) { math } else { badmath.highermath }.square(a) ); // 110 // call a function to get a function // then call that: print(math.adder(5)(22)); // 27 // a string is a linked list of chars // chars are written 'x' // strings are written "abc" // "@" is cons (right associative) // "@@" is append (right associative) // linked lists can also be written between '[' and ']' print( 'a' @ 'b' @ "cd" @@ "ef" @@ ['g', 'h'] ); // "abcdefgh"
Apart from the env
directive, and the fact that strings are
lists of chars, this is still very much a scheme interpreter with a more
syntax laden parser in front of it, but it's early days. Next steps:
typedef List(t) { Pair(t, List(t)) | Null }(except that
List
is already predefined.) t
is a type variable, so this says AList
of some typet
is either aPair
of at
and aList
oft
, or it isNull
.
Pair
and Null
become type constructors
for type List(t)
, so Pair('c', Null)
creates a list of length 1 with type List(char)
. Like
I said, we already have lists, and h @ t
is
(cons h t)
, e.g. Pair(h, t)
.
fn map { (f, []) { [] } (f, h @ t) { f(h) @ map(f, t) } }so the formal arguments to the function can optionally be moved inside the function body, and repeated with alternative formats, like a case statement. The format that unifies with the actual arguments has its body evaluated with the relevant variables bound.
then
:
def name = "Fred" then "Charlie";
fail
as a keyword of no arguments:
fn require(x) { if (x) { true } else { fail } }(That last one might give the type checker a headache.)
This was the bit of Comp. Sci. I always thought looked uninteresting,
but in fact when you delve in to it it's really fascinating and dynamic.
What we're actually talking about here is implicit, strong
type checking. Implicit
means there is not (usually) any need to
declare the type of a variable or function, and strong
means that
there is no possibility of a run-time type error (so there is no need
for run-time type checking.)
Take a look at the following code snippet:
fn double(x) { x + x } def y = 10; def z = double(y);
You and I can both infer a lot of information about
double
, x
, y
and z
from that piece of code, if we just assume the normal meaning for
+
. If we assume that +
is an operator on
two integers, then we can infer that x
is an integer, and therefore the argument to double
must be an
integer, and therefore y
must be an integer. Likewise
since +
returns an integer, double
must
return an integer, and therefore z
must be an integer
(in most languages +
etc. are overloaded to operate on either
ints or floats, but we'll ignore that for now.)
Before diving in to how we might implement our intuition, we need
a formal way of describing types. For simple types like integers we
can just say int
, but functions and operators are just
a bit more tricky. All we're interested in are the argument types
and the return type, so we can describe the type of +
as:
(int, int) -> int
I'm flying in the face of convention here, as most text books
would write that as (int * int) -> int
. No, that
*
isn't a typo, it is meant to be some
cartesian operator for tuples of types, but I think it's just
confusing so I'll stick with commas.
To pursue a more complex example, let's take that adder
function from a previous post:
fn adder(x) { fn(y) { x + y } }
So adder
is a function that takes an integer
x
and returns another function that takes an integer
y
and adds x
to it, returning the
result. We can infer that x
and y
are
integers because of +
just like above.
We're only interested for the moment in the formal type of
adder
, which we can write as:
int -> int -> int
We'll adopt the convention that ->
is
right associative, so we don't need parentheses.
Now for something much more tricky, the famous map
function. Here it is again in F♮:
fn map { (f, []) { [] } (f, h @ t) { f(h) @ map(f, t) } }
map
takes a function and a list, applies the
function to each element of the list, and returns a list of the
results. Let's assume as an example, that the function being passed
to map
is some sort of strlen
.
strlen
's type is obviously:
string -> int
so we can infer that in this case the argument list given to
map
must be a list of string
, and that
map
must return a list of int
:
(string -> int, [string]) -> [int]
(using [x]
as shorthand for list of
).
But what about mapping x
square
over a list of
int
? In that case map
would seem to have
a different type signature:
(int -> int, [int]) -> [int]
In fact, map
doesn't care that much about the
types of its argument function and list, or the type of its return list,
as long as the function and the lists themselves agree. map
is said to be polymorphic. To properly describe the type of
map
we need to introduce type variables which
can stand for some unspecified type. Then we can describe
map
verbally as taking as arguments a function that
takes some type
Formally this can be written:a
and produces some type b
,
and a list of a
, producing a list of b
.
(<a> -> <b>, [<a>]) -> [<b>]
where <a>
and <b>
are type
variables.
So, armed with our formalisms, how do we go about type checking the original example:
fn double(x) { x + x } def y = 10; def z = double(y);
Part of the trick is to emulate evaluation of the code, but only
a static evaluation (we don't follow function calls). Assume that
all we know initially is the type of +
. We set up a
global type environment, completely analogous to a normal
interpreter environment, but mapping symbols to their types rather
than to their values. So our type environment would look like:
{ '+' => (int, int) -> int }
On seeing the function declaration, before we even begin to
inspect the function body, we can add another entry to our global
environment, analogous to the def
of double
(we do this first in case the function is recursive):
{ '+' => (int, int) -> int 'double' => <a> -> <b> }
Note that we are using type variables already, to stand for types we don't know yet. Now the second part of the trick is that these type variables are actually logic variables that can unify with other data.
As we descend into the body of the function, we do something
else analogous to evaluation: we extend the environment with
a binding for x
. But what do we bind x
to?
Well, we don't know the value of x
, but we do have
a placeholder fot its type, namely the type variable <a>
.
We have a tiny choice to make here. Either we bind x
to a new type variable and then unify that type variable with <a>
,
or we bind x
directly to <a>
. Since unifying two
unassigned logic variables makes them the same logic variable, the
outcome is the same:
{ 'x' => <a> } { '+' => (int, int) -> int 'double' => <a> -> <b> }
With this extended type environment we descend into the body and
come across the application of +
to x
and x
.
Pursuing the analogy with evaluation further, we evaluate the
symbol x
to get <a>
. We know also that all operations
return a value, so we can create another type variable <c>
and build the structure (<a>, <a>) -> <c>
.
We can now look up the type of +
and unify
the two types:
(<a>, <a>) -> <c> (int, int) -> int
In case you're not familiar with unification, we'll step through it.
Firstly <a>
gets the value int
:
(int, int) -> <c> ^ (int, int) -> int
Next, because <a>
is int
, the second comparison
succeeds:
(int, int) -> <c> v (int, int) -> int
Finally, <c>
is also unified with int
:
(int, int) -> int ^ (int, int) -> int
So <a>
has taken on the value (and is now indistinguishable from)
int
. This means that our environment has changed:
{ 'x' => int } { '+' => (int, int) -> int 'double' => int -> <b> }
Now we know <c>
(now int
) is the result type
of double
, so on leaving double
we
unify that with <b>
, and discard the extended environment.
Our global environment now contains:
{ '+' => (int, int) -> int 'double' => int -> int }
We have inferred the type of double
!
Proceeding, we next encounter def y = 10;
.
That rather uninterestingly extends our global environment
to:
{ '+' => (int, int) -> int 'double' => int -> int 'y' => int }
Lastly we see the line def z = double(y);
.
Because of the def
we immediately extend our
environment with a binding of z
to a new
placeholder <d>
:
{ '+' => (int, int) -> int 'double' => int -> int 'y' => int 'z' => <d> }
We see the form of a function application, so we look up the value of the arguments and result and create the structure:
int -> <d>
Next we look up the value of double
and unify the two:
int -> <d> int -> int
<d>
gets the value int
and our job is done, the code
type checks successfully.
What if the types were wrong? suppose the code had said
def y = "hello"
? That would have resulted in the
attempted unification:
string -> <d> int -> int
That unification would fail and we would report a type error, without having even run the program!
The Hindley-Milner Algorithm is an algorithm for determining the types of expressions. Basically it's a formalisation of the above discussion. There's an article on Wikipedia which is frustratingly terse and mathematical. I don't mind the maths if it is explained clearly but that article fails to do so. This is my attempt to explain that article to myself, and to anyone else who may be interested.
We start by defining the expressions (e) we will be type-checking:
e | ::= | E | A primitive expression, i.e. 3 . |
| | s | A symbol. | |
| | λs.e | A function definition. s is the formal argument symbol and e is the function body (expression). | |
| | (e e) | The application of an expression to an expression (a function call). |
Next we define our types (τ):
τ | ::= | T | A primitive type, i.e. int . |
| | (τ → τ) | A function of one argument taking type τ and returning type τ (possibly different types.) |
We need a function:
[1] | f(ε, e) | = | τ |
where:
ε | A type environment. |
e | An expression. |
τ | A type. |
We assume we already have:
[2] | f(ε, E) | = | T | A set of mappings from primitive expressions to their types. |
(from 3
to int
, for example.)
The following equations are logic equations. They are easy enough to read, Everything above the line are assumptions. The statement below the line should follow if the assumptions are true.
Our second assumption is:
[3] | (s, τ) | ∈ | ε | If (s, τ) is in ε (i.e. if ε has a mapping from s to τ) |
f(ε, s) | = | τ | Then in the context of ε, s is a τ |
Informally symbols are looked up in the type environment
.
[4] | f(ε, g) | = | τ1 → τ | If g is a function mapping a τ1 to a τ |
f(ε, e) | = | τ1 | and e is a τ1 | |
f(ε, (g e)) | = | τ | Then the application of g to e is a τ |
That is just common sense.
[5] | ε1 | = | ε ∪ (s, τ) | If ε1 is ε extended by (s, τ) |
f(ε, λs.e) | = | τ → f(ε1, e) | Then type the body in ε1 |
This is just a bit tricky. We don’t necessarily know the value of τ (the type of s) when evaluating this expression, but that’s what logic variables are for.
new
which returns a fresh
type variable each time it is called.eq
which unifies two equations.We modify our function, part [4] (function application) as follows:
[6] | τ0 | = | f(ε, e0) | If τ0 is the type of e0 |
τ1 | = | f(ε, e1) | and τ1 is the type of e1 | |
τ | = | new |
and τ is a fresh type variable | |
f(ε, (e0 e1)) | = | eq (τ0, τ1 → τ); τ |
Then after unifying τ0 with τ1 → τ, the type of (e0 e1) is τ. |
That deserves a bit of discussion. We know e0 is a function, so it must have a type τa → τb for some types τa and τb. We calculate τ0 as the provisional type of e0 and τ1 as the type of e1, then create a new type variable τ to hold the type of (e0 e1).
Suppose e0 is the function length
(the length of a
list of some unspecified type τ2), then τ0 should come
out as [τ2] → int
(using [x] to mean
list of x
.)
Suppose further that τ1 is char
.
We therefore unify:
[τ2] | → | int |
[char ] |
→ | τ |
Which correctly infers that the type of (length
[‘c’
]) is int
. Unfortunately, in doing so,
we permanently unify τ2 with char
, forcing
length
to have permanent type [char
]
→ int
so this algorithm does not cope with
polymorphic functions such as length
.
I've mentioned Currying and partial function application already. The idea is that given a function with more than one argument:
fn add (x, y) { x + y }
if we call it with less than the arguments it expects, then it will return a function that accepts the rest:
def add2 = add(2,);
add2(4); // 6
(The trailing comma is just a syntactic convention that I've come up with that lets the compiler know that we know what we are doing, and lets the reader know that there is Currying going on.) Now setting aside how we might type-check that, it turns out that it's actually pretty easy to evaluate.
Normal application of a closure looks something like this (Java-ish pseudocode):
class Closure { private List<Symbol> fargs; private Env env; private Expr body; Expr apply(List<Expr> actual_arguments) { Dictionary<Symbol, Expr> dict; List<Symbol> formal_arguments = this.fargs; while(formal_arguments.isNotNull() && actual_arguments.isNotNull()) { dict.put(formal_arguments.head(), actual_arguments.head()); formal_arguments = formal_arguments.tail(); actual_arguments = actual_arguments.tail(); } return this.body.eval(this.env.extend(dict)); } }
For those of you that don't know Java, List<Symbol>
means List of Symbol
. And yes, we're ignoring the possibility
that we're passed the wrong number of arguments, the type checker should
deal with that.
Now if we are expecting that we might get fewer than the full set of arguments, we can instead create a new closure that expects the rest:
class Closure { private List<Symbol> fargs; private Env env; private Expr body; Expr apply(List<Expr> actual_arguments) { Dictionary<Symbol, Expr> dict; List<Symbol> formal_arguments = this.fargs; while(formal_arguments.isNotNull() && actual_arguments.isNotNull()) { dict.put(formal_arguments.head(), actual_arguments.head()); formal_arguments = formal_arguments.tail(); actual_arguments = actual_arguments.tail(); } if (formal_arguments.isNotNull()) { return new Closure(formal_arguments, this.body, this.env.extend(dict)); } else { return this.body.eval(this.env.extend(dict)); } } }
Note that the dictionary that we have been building is used to extend the
environment of the new closure with the values we know already,
and that the formal_arguments
we've been chaining
down is now precisely the remaining arguments that we haven't
seen yet.
Of course this allows for silly definitions like:
fn add (x, y) { x + y } def anotheradd = add(,);
But presumably our type checker (if not our grammar) would disallow that sort of thing, because there's nothing to put a trailing comma after.
[Edit] You could alternatively add a guard clause to apply()
that says if this closure is expecting arguments and doesn't get
any, just return the original closure.
That way, something like:
add()()(2)()()(3)
while still silly, would at least not be unnecessarily inefficient.
So I got the above working in F♮ easily enough, then I noticed an anomaly. The type of:
fn adder(x) { fn(y) { x + y } }
is definately int -> int -> int
, which means that the
type checker is allowing it to be called like: adder(2, 3)
.
Why can't I call it like that? It turns out I can:
class Closure {
private List<Symbol> fargs;
private Env env;
private Expr body;
Expr apply(List<Expr> actual_arguments) {
Dictionary<Symbol, Expr> dict;
List<Symbol> formal_arguments = this.fargs;
while(formal_arguments.isNotNull() &&
actual_arguments.isNotNull()) {
dict.put(formal_arguments.head(),
actual_arguments.head());
formal_arguments = formal_arguments.tail();
actual_arguments = actual_arguments.tail();
}
if (formal_arguments.isNotNull()) {
return new Closure(formal_arguments,
this.body,
this.env.extend(dict));
} else if (actual_arguments.isNotNull()) {
return this.body
. eval(this.env.extend(dict))
. apply(actual_arguments);
} else {
return this.body.eval(this.env.extend(dict));
}
}
}
Assuming the type checker has done its job, then if we have
any actual arguments left over then they must be destined for the
function that must be the result of evaluating the body. So instead
of just evaluating the body in the new env, we additionally call
apply()
on the result, passing in the left-over
arguments.
This is pretty cool. We can have:
fn adder(x) { fn (y) { x + y } }
and call it like:
adder(2, 3)
or:
adder(2)(3)
and we can have:
fn add(x, y) { x + y }
and call it like:
add(2, 3)
or:
add(2)(3)
The question arises: if we have implicit Currying, (partial
function application) then do we need explicit Currying (explicitly
returning a function from a function)?
The answer is a resounding
yes! Consider:
fn bigfn (a) { if (expensive_calculation(a)) { fn (b) { cheap_operation_1(b) } } else { fn (b) { cheap_operation_2(b) } } } map(bigfn(x), long_list);
We've only called bigfn
once, when evaluating the
first argument to map
, so expensive_calculation
only got called once, and the resulting anonymous closure calling either
cheap_operation_1
or cheap_operation_2
gets
called on each element of the list.
If instead we had written:
fn bigfn (a, b) { if (expensive_calculation(a)) { cheap_operation_1(b) } else { cheap_operation_2(b) } } map(bigfn(x,), long_list);
Then the call to expensive_calculation
would get
deferred until the map
actually called its argument
function, repeatedly, for each element of the long_list
.
What may not be clear to readers in a lot of the above discussions is the use of Algebraic Data Types in combination with patterm matching to define functions. It's really quite simple, conceptually (implementation may be a different matter, we'll see.) Here's an example we've seen before, I'll just be more descriptive this time:
typedef list(t) { cons(t, list(t)) | null }
This declaration achieves two things:
list(t)
(list of
t
) where t
is a type variable
that can stand for any type.cons
and null
, that accept arguments of the specified types
(none in the case of null
,)
and return data of type list(t)
.Reading it aloud, it says define a type
which is either a list
of some unspecified type t
cons
of a
t
and a list
of t
, or a
null
.
Once defined, we use these type costructors to create lists of a concrete type:
def a = cons(true, cons(false, cons(false, null)));
After the above definition, a
has type list(bool)
.
The following, on the other hand, would fail to type check:
def a = cons(true, cons('x', null));
It fails because:
cons('x', null)
is of type list(char)
.cons
expects arguments <t>
and
list(<t>)
, but it gets bool
and
list(char)
.cons
cannot reconcile <t> = bool
with <t> = char
so the type check fails.That's all very nice, but how can we use Algeraic Data Types? It turns out that they become very useful in combination with pattern matching in case statements. Consider:
fn length(a) { switch(a) { case (cons(head, tail)) { 1 + length(tail) } case (null) { 0 } } }
In that case statement, a
must match either
cons(head, tail)
or null
. Now if it
matches cons(head, tail)
, the (normal) variables
head
and tail
are automatically created
and instantiated as the relevant components of the cons
in the body of the case statement. This kind of behaviour is so
commonplace in languages like ML that special syntax for functions
has evolved, which I'm borrowing for F♮:
fn length { (cons(head, tail)) { 1 + length(tail) } (null) { 0 } }
This version of length
, instead of having a single
formal argument list outside the body, has alternative formal argument
lists inside the body, with mini bodies of their own, just like a case
statement. It's functionally identical to the previous version, but
a good deal more concise and readable.
One thing to bear in mind, in both versions, is that length
has type list(t) -> int
. That is to say, each of the
formal argument lists inside the body of a function, or the alternative
cases in a case statement, must agree in the number and types of the
arguments, and must return the same type of result.
Now, it becomes obvious that, just as we can rewrite a let
to be a lambda
, this case statement is in fact just
syntactic sugar for an anonymous function call. The earlier definition
of length
above, using a case statement, can be re-written
as:
fn length(a) { fn { (cons(head, tail)) { 1 + length(tail) } (null) { 0 } }(a) }
so we get case statements: powerful, pattern matching ones, allowing more than one argument, for free if we take this approach.
length
is polymorphic. It does
not do anything to the value of head
so does not care
about its type. Therefore the type of length
, namely
list(t) -> int
actually contains a type variable
t
.
Here's a function that does care about the type of the list:
fn sum_strlen { (cons(head, tail)) { strlen(head) + sum_strlen(tail) } (null) { 0 } }
Assuming strlen
has type string -> int
,
that would constrain sum_strlen
to have type
list(string) -> int
. Of course that's a rather silly
function, we would be better passing in a function like this:
fn sum { (fn, cons(h, t)) { fn(h) + sum(fn, tail) } (fn, null) { 0 } }
That would give sum
a type:
(t -> int) -> list(t) -> int
and we could call it like:
def a = cons("hello", cons(" ", cons("there", null))); def len = sum(strlen, a);
or even, with a Curried application:
def sum_strlen = sum(strlen);
This is starting to look like map-reduce. More on that later.
Algebraic Data Types really come in to their own when it comes to tree walking. Consider the following definitions:
typedef expr(t) { value(t) | add(expr(t), expr(t)) | sub(expr(t), expr(t)) | mul(expr(t), expr(t)) | div(expr(t), expr(t)) }
Given that, we can write an evaluator for arithmetic expressions very easily:
fn eval { (value(i)) { i } (add(l, r)) { eval(l) + eval(r) } (sub(l, r)) { eval(l) - eval(r) } (mul(l, r)) { eval(l) * eval(r) } (div(l, r)) { eval(l) / eval(r) } }
So eval
has type expr(int) -> int
.
We can call it like:
eval(add(value(2), mul(value(3), value(5))));
to get 17.
Pattern matching not only covers variables and type constructors,
it can also cope with constants. For example here's a definition of
factorial
:
fn factorial { (0) { 1 } (n) { n * factorial(n - 1) } }
For this and other examples to work, the cases must be checked in order
and the first case that matches is selected. so the argument to
factorial
would only match n
if it failed
to match 0
.
As another example, here's member
:
fn member { (item, item @ tail) { true } (item, _ @ tail) { member(item, tail) } (item, []) { false } }
Here I'm using F♮'s built-in list type constructors
, (pronounced @
cons
,) and
(pronounced []
null
,) and a wildcard
to indicate a
_
don't care
variable that always unifies, but apart from that
it's just the same as the cons
and null
constructors. Anyway, the cases say:
member(item, list)
is true if item
is at
the head of the list.member(item, list)
is true if item
is
a member
of the tail
of the list.item
is not a member of the empty list.You've probably realised that given a type like list(t)
above, it's not possible to directly create lists of mixed type.
That is because it is usually a very bad idea to do so. However if you need
to do so, you can get around the restriction without breaking any rules,
as follows:
typedef either(t, u) { first(t) | second(u) }
def a = [ first("a string") , second(20) ]
After the above definition, a
has type
list(either(string, int))
, and you can't get at the data
without knowing its type:
fn sum_numbers { (first(_) @ tail) { sum_numbers(tail) } (second(n) @ tail) { n + sum_numbers(tail) } ([]) { 0 } }
Here, sum_numbers
has type
[either(<t>,int)] -> int
. e.g. it doesn't
care what type first
holds. We could have written
first(s)
instead of first(_)
, but the
use of a wildcard
explicitly says we don't care,
stops any potential warnings about unused variables, and is more efficient._
I'm actually quite annoyed, for once. I remember reading a completely lucid description of Dependency Injection some time ago, but recently I've done a brief search of the web for documents on the subject and they're unanimously impenetrable, at least for someone with my attention span. So here's my explaination of DI Catalogues in as few words as I can.
Firstly we need a catalogue:
package Catalogue; sub new { bless {}, $_[0] } sub get { my ($self, $thing) = @_; $self->{$thing}->($self); }
Next we need to populate it:
my $catalogue = new Catalogue(); $catalogue->{Foo} = sub { my ($c) = @_; new Foo($c->get('Bar'), $c->get('Baz')); }; $catalogue->{Bar} = sub { new Bar() }; $catalogue->{Baz} = sub { new Baz() };
Finally we get to use it:
my $foo = $catalogue->get('Foo');
That is all there is to it! Of course this omits all error checking, but you can add that yourself once you understand the principles.